[CVPR 2025]
[Project Page] [Paper] [Supp] [Dataset]
We introduces a new task: Motion-Grounded Video Reasoning, where models must answer motion-related questions using spatiotemporal segmentation masks as visual responses.
This task addresses key limitations in prior video understanding research by introducing:
- β Implicit question-based reasoning
- π Motion-aware temporal localization
- π§ Object-level visual grounding
- π― Pixel-level mask generation across time
- π§© Four question types: Causal, Sequential, Counterfactual, and Descriptive
Figure 1: GROUNDMORE fills the gap between referring segmentation, temporal grounding, and reasoning by combining implicit QA with visual spatiotemporal output.
The Motion-Grounded Video Reasoning task requires models to:
-
Input:
- A video clip
V β βα΅Λ£Κ°Λ£Κ·Λ£Β³ - A motion-related question
Q
- A video clip
-
Output:
- Spatiotemporal segmentation masks
M β βα΅β²Λ£Κ°Λ£Κ·highlighting the target object
- Spatiotemporal segmentation masks
This output represents the reasoning result visually by grounding the answer over space and time.
We collect a new benchmark dataset: GROUNDMORE, designed to evaluate fine-grained motion reasoning.
- 1.7K high-resolution video clips
- 7.6K question-answer pairs
- 249K object-level spatiotemporal masks
- Diverse video categories: family scene, animal, ball game, and outdoor activity
Table 1: Motion-Grounded Video Reasoning supports all dimensions: spatial & temporal context, motion abstraction, pixel-level output, and implicit reasoning.
Table 2: GROUNDMORE contains more dense QA + segmentation annotations than prior benchmarks, especially in motion-related reasoning.
We propose a baseline model called MoRA, built for this task. It integrates:
- LLaVA for multimodal reasoning
- SAM decoder for spatial mask decoding
- [SEG] token for object semantic embedding
- [LOC] token for temporal localization of motion events
Figure 3: MoRA outputs pixel-level segmentation masks as response for the input motion-related question.
Table 3: MoRA achieves SOTA on all question types, outperforming previous baseline models.
Table 5: Temporal localization via [LOC] token significantly improves performance.
git clone https://github.com/groundmore/GROUNDMORE.git
cd GROUNDMORE
conda create -n groundmore python=3.10
conda activate groundmore
pip install -r requirements.txt
pip install flash-attn --no-build-isolationGroundMoRe is available at: https://huggingface.co/datasets/groundmore/GroundMoRe
Before training, you need to obtain LISA and SAM for model initialization.
Put SAM pretrained weights at ./pretrain_weights/
We use Refer-YouTube-VOS, MeViS dataset for zero-shot training.
bash run.sh
python evaluate_groundmore.py- Release MoRA-FT-LISA7B
- Release MoRA-ZS-LISA13B
- Release MoRA-FT-LISA13B
If this work is useful for your research, please cite:
@inproceedings{deng2025groundmore,
title={Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level},
author={Deng, Andong and Chen, Tongjia and Yu, Shoubin and Yang, Taojiannan and Spencer, Lincoln and Tian, Yapeng and Mian, Ajmal Saeed and Bansal, Mohit and Chen, Chen},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}This work is built upon LISA and SAM.
We also appreciate the valuable help from Wenshuo Chen and Erhang Zhang during the GroundMoRe data collection.





