| CARVIEW |
Select Language
Abstract
Video-Holmes is a benchmark designed to evaluate the complex video reasoning capabilities of MLLMs.
Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films (ranging from 1 to 5 minutes), which spans seven carefully designed tasks. Each task is constructed by first identifying key events and causal relationships within films, and then designing questions that require models to actively locate and connect multiple relevant visual clues scattered across different video segments.
⭐ Key Aspects of Video-Holmes:
Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films (ranging from 1 to 5 minutes), which spans seven carefully designed tasks. Each task is constructed by first identifying key events and causal relationships within films, and then designing questions that require models to actively locate and connect multiple relevant visual clues scattered across different video segments.
⭐ Key Aspects of Video-Holmes:
- One-Click Evaluation: Videos, audios, questions, and evaluation codes are packaged on GitHub and Hugging Face.
- High Reasoning Demand: Significant performance gap between reasoning models and non-reasoning models.
- Reasoning Process Analysis: Clearly visualizes the reasons behind correct and incorrect model responses.
LeaderBoard
Leaderboard of Video-Holmes, where SR means Social Reasoning; IMC means Intention and Motive Chaining; TCI means Temporal Causal Inference; TA Timeline Analysis; MHR means Multimodal Hint Reasoning; PAR means Physical Anomaly Reasoning; CTI means Core Theme Inference.
| # | Model | Audio | SR | IMC | TCI | TA | MHR | PAR | CTI | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Gemini-2.5-Pro 🥇 | ✅ | 54.8 | 54.3 | 53.8 | 56.0 | 48.8 | 46.4 | 44.8 | 51.3 |
| 2 | Gemini-2.0-Flash-Thinking 🥈 | ✅ | 56.5 | 54.2 | 43.4 | 44.5 | 43.9 | 55.1 | 50.1 | 49.5 |
| 3 | Gemini-1.5-Pro 🥉 | ✅ | 59.6 | 54.7 | 37.4 | 33.5 | 40.4 | 47.4 | 44.4 | 45.7 |
| 4 | Gemini-2.5-Pro | ❌ | 46.6 | 49.3 | 46.9 | 53.0 | 40.1 | 44.3 | 37.4 | 45.0 |
| 5 | Gemini-2.0-Flash-Thinking | ❌ | 43.4 | 46.9 | 43.1 | 51.0 | 37.9 | 43.6 | 39.3 | 43.1 |
| 6 | GPT-4o | ❌ | 50.0 | 49.6 | 38.8 | 30.0 | 44.0 | 39.2 | 37.0 | 42.0 |
| 7 | Gemini-1.5-Pro | ❌ | 52.1 | 48.2 | 34.4 | 26.0 | 39.2 | 46.4 | 38.9 | 41.2 |
| 8 | Claud 3.5 Sonnet | ❌ | 45.9 | 48.2 | 33.7 | 39.5 | 40.7 | 39.7 | 38.1 | 41.0 |
| 9 | Video-RTS | ❌ | 51.0 | 44.6 | 33.3 | 43.0 | 37.1 | 35.6 | 38.2 | 40.0 |
| 10 | Claud 3.7 Sonnet | ❌ | 48.6 | 43.5 | 30.8 | 41.0 | 39.8 | 36.6 | 33.7 | 39.3 |
| 11 | Qwen2.5-VL-32B | ❌ | 43.2 | 44.2 | 31.5 | 51.0 | 36.4 | 31.4 | 32.2 | 38.4 |
| 12 | Video-R1 | ❌ | 48.6 | 41.7 | 28.9 | 34.5 | 31.0 | 33.5 | 35.9 | 36.5 |
| 13 | SpaceR | ❌ | 48.2 | 39.4 | 26.0 | 33.0 | 28.9 | 35.1 | 35.6 | 35.2 |
| 14 | SEED-Bench-R1 | ❌ | 42.8 | 35.1 | 25.6 | 40.5 | 29.2 | 29.9 | 32.6 | 33.5 |
| 15 | VideoChat-R1 | ❌ | 42.1 | 38.8 | 24.5 | 39.5 | 29.5 | 27.8 | 29.3 | 33.0 |
| 16 | InternVL3-8B | ❌ | 29.5 | 40.7 | 37.9 | 35.1 | 24.6 | 38.9 | 24.1 | 32.3 |
| 17 | Gemini-2.0-Flash | ❌ | 41.8 | 33.7 | 23.1 | 20.5 | 30.1 | 26.8 | 33.7 | 30.6 |
| 18 | OpenAI o4-mini | ❌ | 36.3 | 31.2 | 20.5 | 34.0 | 30.1 | 30.9 | 27.4 | 29.9 |
| 19 | Qwen2.5-VL-7B | ❌ | 38.4 | 34.8 | 17.6 | 30.0 | 27.1 | 18.6 | 25.2 | 27.8 |
| 20 | Qwen2.5-Omni-7B | ✅ | 38.4 | 30.8 | 22.3 | 12.0 | 21.1 | 21.1 | 20.7 | 24.4 |
| 21 | InternVL2.5-8B | ❌ | 27.8 | 32.1 | 21.2 | 7.6 | 25.4 | 23.6 | 22.4 | 23.6 |
| 22 | Qwen2.5-Omni-7B | ❌ | 27.1 | 19.9 | 13.9 | 7.5 | 14.8 | 14.9 | 13.7 | 16.4 |
Construction and Evaluation Pipeline
We select 270 high-quality suspense short films for human annotation. Next, we design 7 challenging tasks and employ DeepSeek to generate questions. Finally, we evaluate SOTA MLLMs and use DeepSeek to analyze their responses (optional).
Question Types
Existing benchmarks primarily involve clue-given questions, where models depend on explicitly provided clues to derive answers. In contrast, Video-Holmes adopts an active seeking paradigm, requiring models to actively locate and connect multiple relevant visual clues scattered across different video segments.
Examples
Examples of questions, explanations, model answers, and analyses of the reasoning process of Video-Holmes.
Citation
@article{cheng2025video,
title={Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?},
author={Cheng, Junhao and Ge, Yuying and Wang, Teng and Ge, Yixiao and Liao, Jing and Shan, Ying},
journal={arXiv preprint arXiv:2505.21374},
year={2025}
}