You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We propose UniVG-R1, a reasoning guided MLLM for universal visual grounding, which employs GRPO with a cold-start initialization to effectively enhance reasoning capabilities across multimodal contexts.
A high-quality CoT dataset is introduced, encompassing diverse tasks, each meticulously annotated with detailed reasoning chains to facilitate advanced reasoning-based grounding.
We identify a difficulty bias in GRPO training, and propose a difficulty-aware weight adjustment strategy. Experiments validate that GRPO equipped with this strategy consistently enhance the model performance.
Extensive experiments demonstrate that our model achieves state-of-the-art performance across multiple grounding benchmarks, showcasing its versatility and generalizability.
🔥 Demo
We provide an online demo on 🤗Huggingface for convenience.
If you want to test locally, you can use the demo.py.
🛠️ Installation
Our code is based on VLM-R1, please follow their instrcutions to set up the environment.
For evaluation on the MIG-Bench, please follow their instructions to set up the environment. And we provide the UniVG-R1 scripts in the eval folder. Please replace them in the Migician codebase.
For zero-shot evaluation, please first download original data, including LISA, LLMSeg, ReVOS, ReasonVOS. Then download our extracted bounding box annotations here.
📈 Results
Performance on the MIG-Bench.
Zero-shot performance on several reasoning grounding benchmarks.
If you find our paper and code helpful for your research, please consider starring our repository ⭐ and citing our work ✏️.
@article{bai2025univg,
title={UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning},
author={Bai, Sule and Li, Mingxing and Liu, Yong and Tang, Jing and Zhang, Haoji and Sun, Lei and Chu, Xiangxiang and Tang, Yansong},
journal={arXiv preprint arXiv:2505.14231},
year={2025}
}
About
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning