UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning

TL;DR

1. We propose UniVG-R1, a reasoning guided MLLM for universal visual grounding, which employs GRPO training combined with a cold-start initialization to effectively enhance reasoning capabilities across multimodal contexts.
2. A high-quality CoT grounding dataset is introduced, encompassing diverse tasks, each meticulously annotated with detailed reasoning chains to facilitate advanced reasoning-based grounding.
3. We identify a difficulty bias in GRPO training, and propose a difficulty-aware weight adjustment strategy. Experiments validate that GRPO equipped with this strategy consistently enhance the model performance.
4. Extensive experiments demonstrate that our model achieves state-of-the-art performance across multiple grounding benchmarks, showcasing its versatility and generalizability.

Abstract

Traditional visual grounding methods primarily focus on single-image scenarios with simple textual references. However, extending these methods to real-world scenarios that involve implicit and complex instructions, particularly in conjunction with multiple images, poses significant challenges, which is mainly due to the lack of advanced reasoning ability across diverse multi-modal contexts. In this work, we aim to address the more practical universal grounding task, and propose UniVG-R1, a reasoning guided multimodal large language model (MLLM) for universal visual grounding, which enhances reasoning capabilities through reinforcement learning (RL) combined with cold-start data. Specifically, we first construct a high-quality Chain-of-Thought (CoT) grounding dataset, annotated with detailed reasoning chains, to guide the model towards correct reasoning paths via supervised fine-tuning. Subsequently, we perform rule-based reinforcement learning to encourage the model to identify correct reasoning chains, thereby incentivizing its reasoning capabilities. In addition, we identify a difficulty bias arising from the prevalence of easy samples as RL training progresses, and we propose a difficulty-aware weight adjustment strategy to further strengthen the performance. Experimental results demonstrate the effectiveness of UniVG-R1, which achieves state-of-the-art performance on MIG-Bench with a 9.1% improvement over the previous method. Furthermore, our model exhibits strong generalizability, achieving an average improvement of 23.4% in zero-shot performance across four image and video reasoning grounding benchmarks.

Pipeline

We adopt a two-stage training process. The first stage employs CoT-SFT, with the training data construction shown in (a). The second stage utilizes GRPO equipped with a difficulty-aware weight adjustment strategy in (b). The GRPO training process is illustrated in (c), where the policy model generates multiple responses, and each is assigned a distinct reward.

Results

Difficulty-Aware Weight Adjustment Strategy

During the stage 2 reinforcement learning process, we observe that most samples progressively become easier for the model, with the proportion of easy samples increasing and the proportion of hard samples steadily decreases. Since the GRPO algorithm normalizes rewards to calculate the relative advantage within each group, easy samples (e.g., \(\textit{mIoU}\) = 0.8) receives the same policy gradient update as hard samples (e.g., \(\textit{mIoU}\) = 0.2). This leads to a difficulty-bias issue. In particular, during the later stages of training, as easy samples become predominant, most updates are derived from these easier instances, making it difficult for the model to focus on hard samples.

To address this problem, we propose a difficulty-aware weight adjustment strategy, which dynamically adjusts the weight of each sample based on its difficulty. Specifically, we introduce a difficulty coefficient \( \phi \propto -\textit{mIoU} \) to quantify the difficulty level of each sample, where the function \( \phi \) is negatively correlated with \(\textit{mIoU}\). This coefficient dynamically adjusts the sample weights by computing the average accuracy reward of different responses for each sample. The detailed formula is provided below.

\[ \mathcal{J}_{GRPO}(\theta) = \mathbb{E}_{q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|q)} \left[ \frac{1}{G}\sum_{i=1}^G {\color{blue} \phi(\mathit{mIoU})} \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}A_i - \beta\mathbb{D}_{KL}(\pi_{\theta}||\pi_{ref}) \right] \]

Visualization

Acknowledgement

Our work is primarily based on Migician, VLM-R1, LLaMA-Factory, lmms-eval. We are sincerely grateful for their excellent works.

BibTeX

@article{bai2025univg,
      title={UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning},
      author={Bai, Sule and Li, Mingxing and Liu, Yong and Tang, Jing and Zhang, Haoji and Sun, Lei and Chu, Xiangxiang and Tang, Yansong},
      journal={arXiv preprint arXiv:2505.14231},
      year={2025}
}