Carview!

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

Ye Liu^1†, Kevin Qinghong Lin^2†, Chang Wen Chen¹, Mike Zheng Shou²

¹The Hong Kong Polytechnic University ²Show Lab, National University of Singapore

TL;DR: Pioneer DeepSearch-like Video Understanding.

VideoMind is a multi-modal agent framework that enhances video reasoning by emulating human-like processes, such as breaking down tasks, localizing and verifying moments, and synthesizing answers. This approach addresses the unique challenges of temporal-grounded reasoning in a progressive strategy.

🔥 News

2025.04.05 📊 See BENCHMARK.md for evaluation results of VideoMind on public benchmarks.
2025.03.28 🚀 VideoMind-2B is ready on Hugging Face Spaces. Check it out!
2025.03.21 ⭐️ Code, model, and dataset release.
2025.03.17 🎉 Our tech report is available online.

🏆 VideoMind on Public Benchmarks

Benchmark	Evaluation Results (2B/7B)
`ZS` CG-Bench (mini)	`long-acc: 31.0/38.4` `rec@IoU: 8.50/9.93` `acc@IoU: 4.02/4.67`
`ZS` ReXTime (val)	`mIoU: 24.83/27.61` `Acc: 69.06/74.59` `Acc@IoU: 17.26/20.20`
`ZS` NExT-GQA (test)	`mIoU: 28.6/31.4` `mIoP: 36.4/39.0` `Acc@GQA: 25.2/28.2`
`ZS` DeVE-QA (val)*	`mIoU: 26.3/30.1` `mIoP: 49.9/51.9` `Acc@GQA: 41.2/44.2`
`ZS` Charades-STA (test)	`R@0.5: 51.1/59.1` `R@0.7: 26.0/31.2` `mIoU: 45.2/50.2`
`ZS` ActivityNet-Captions (val_2)	`R@0.5: 26.5/30.3` `R@0.7: 12.6/15.7` `mIoU: 30.1/33.3`
`FT` QVHighlights (test)	`R@0.5: 75.42/78.53` `R@0.7: 59.35/61.09` `mAP: 51.60/54.19`
`FT` TACoS (test)	`R@0.5: 26.9/36.2` `R@0.7: 15.5/21.4` `mIoU: 27.4/34.4`
`ZS` Ego4D-NLQ (val)	`R@0.5: 2.9/3.7` `R@0.7: 1.2/1.7` `mIoU: 4.7/5.4`
`ZS` ActivityNet-RTL (val)	`P@0.5: 20.1/28.0` `mIoU: 22.7/31.3`
`ZS` Video-MME (w/o subs)	`All: 55.4/58.2` `Long: 46.3/49.2`
`ZS` MLVU	`M-Avg: 58.7/64.4`
`ZS` LVBench	`Overall: 35.4/40.8`
`ZS` MVBench	`Acc: 62.5/64.6`
`ZS` LongVideoBench	`Acc: 48.8/56.3`

ZS and FT refer to zero-shot and fine-tuned settings, respectively. * means third-party results.

See BENCHMARK.md for full evaluation results.

🕹️ Gradio Demo

demo.mp4

Play with our online demo or see DEMO.md for guidelines about how to deploy it locally.

📦 VideoMind-SFT Dataset

We provide raw videos, compressed videos, and pre-processed annotations of 27 video grounding / QA datasets, including our VideoMind-SFT (481K) for training and multiple benchmarks for evaluation. We also release the datasets used during our early exploration (but not included in the final version) to facilitate future research.

The list of source datasets is shown below. See our dataset repo for more details.

Grounder (210K):

Dataset	Source	Processed (Recommended)
QVHighlights	Link	`qvhighlights`
DiDeMo	Link	`didemo`
TACoS	Link	`tacos`
QuerYD	Link	`queryd`
HiREST (Grounding)	Link	`hirest`
HiREST (Step Captioning)	Link	`hirest`
CosMo-Cap	Link	`cosmo_cap`
InternVid-VTime	Link	`internvid_vtime`

Verifier (232K):

Dataset	Source	Processed (Recommended)
QVHighlights-Verify	Link	`verifying`, `qvhighlights`
DiDeMo-Verify	Link	`verifying`, `didemo`
TACoS-Verify	Link	`verifying`,`tacos`

Planner (39K):

Dataset	Source	Processed (Recommended)
NExT-QA-Plan	Link	`planning`, `nextqa`
QVHighlights-Plan	Link	`planning`, `qvhighlights`

Benchmarks

Dataset	Task	Source	Processed (Recommended)
CG-Bench	Grounded VideoQA	Link	`cgbench`
ReXTime	Grounded VideoQA	Link	`rextime`, `activitynet`, `qvhighlights`
NExT-GQA	Grounded VideoQA	Link	`nextgqa`
Charades-STA	VTG	Link	`charades_sta`
ActivityNet-Captions	VTG	Link	`activitynet_captions`, `activitynet`
QVHighlights	VTG	Link	`qvhighlights`
TACoS	VTG	Link	`tacos`
Ego4D-NLQ	VTG	Link	`ego4d_nlq`, `ego4d`
ActivityNet-RTL	VTG	Link	`activitynet_rtl`, `activitynet`
Video-MME	General VideoQA	Link	`videomme`
MLVU	General VideoQA	Link	`mlvu`
LVBench	General VideoQA	Link	`lvbench`
MVBench	General VideoQA	Link	`mvbench`
LongVideoBench	General VideoQA	Link	`longvideobench`

The following datasets are not used in our project (partially used during early exploration), but we still share them to facilitate future research.

Dataset	Task	Training	Evaluation	Source	Processed (Recommended)
QaEgo4D	Grounded VideoQA	✅	✅	Link	`qa_ego4d`, `ego4d`
Ego4D-NaQ	VTG	✅	✅	Link	`ego4d_naq`, `ego4d`
Ego-TimeQA	VTG	✅	❌	Link	`ego_timeqa`, `ego4d`
Vid-Morp	VTG	✅	❌	Link	`vid_morp`
VideoXum	VTG (originally VS)	✅	✅	Link	`videoxum`
YouCook2	VTG (originally DVC)	✅	✅	Link	`youcook2`
STAR	VideoQA	✅	✅	Link	`star`, `charades_sta`
COIN	-	-	-	Link	`coin`

Notes:

For some datasets (e.g., ReXTime), the annotations and videos are stored in different folders. All the directories in Processed need to be downloaded.
Use the following commands to concatenate and extract video tar splits (e.g., videos.tar.gz.00, videos_3fps_480_noaudio.tar.gz.00).

# videos.tar.gz.00, videos.tar.gz.01
cat videos.tar.gz.* | tar -zxvf -
# videos_3fps_480_noaudio.tar.gz.00, videos_3fps_480_noaudio.tar.gz.01
cat videos_3fps_480_noaudio.tar.gz.* | tar -zxvf -

🚀 Training

Our codebase supports training and evaluating on 27 video datasets and benchmarks with the following features.

Flexible hardware settings: NVIDIA GPU / Ascend NPU, Single-Node / Multi-Node
Efficient training techniques: DeepSpeed ZeRO, BF16, LoRA, SDPA, FlashAttention2, Liger-Kernel
Customizing the base LLM and conversation templates
Monitoring the training process via Tensorboard / Wandb
Group sampling for mixed dataset training
Multi-process / multi-device evaluation on public benchmarks

See TRAIN.md for a quick start guide.

🔮 Evaluation

See EVAL.md for details about evaluating VideoMind on public benchmarks.

📖 Citation

Please kindly cite our paper if you find this project helpful.

@article{liu2025videomind,
  title={VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning},
  author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2503.13444},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github		.github
demo		demo
docs		docs
scripts		scripts
videomind		videomind
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

🔥 News

🏆 VideoMind on Public Benchmarks

🕹️ Gradio Demo

📦 VideoMind-SFT Dataset

Grounder (210K):

Verifier (232K):

Planner (39K):

Benchmarks

🚀 Training

🔮 Evaluation

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

yeliudev/VideoMind

Folders and files

Latest commit

History

Repository files navigation

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

🔥 News

🏆 VideoMind on Public Benchmarks

🕹️ Gradio Demo

📦 VideoMind-SFT Dataset

Grounder (210K):

Verifier (232K):

Planner (39K):

Benchmarks

🚀 Training

🔮 Evaluation

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages