CARVIEW

MOTORHOMES

Select Language

HTTP/2 301 server: GitHub.com content-type: text/html location: https://video-holmes.github.io/Page.github.io/ x-github-request-id: A1D0:3ABDEF:8596A9:96418C:69522549 accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 06:52:57 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210029-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766991178.548754,VS0,VE200 vary: Accept-Encoding x-fastly-request-id: c66815ead83465815ba8b02003802e2f13ba3ab1 content-length: 162 HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Thu, 07 Aug 2025 09:10:12 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"68946d74-574e" expires: Mon, 29 Dec 2025 07:02:57 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 52B7:1F53DD:8694BE:973EFA:69522542 accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 06:52:57 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210029-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766991178.768485,VS0,VE218 vary: Accept-Encoding x-fastly-request-id: 5ef2a937a40c3052baa22914f2075baf0cbbf8aa content-length: 4644 Video-Holmes

Video-Holmes
Can MLLM Think like Holmes for Complex Video Reasoning?

Junhao Cheng^1,2, Yuying Ge ^1,, Teng Wang ^1,, Yixiao Ge ¹, Jing Liao ², Ying Shan ¹

¹ARC Lab, Tencent PCG, ²City University of Hong Kong

Github

arXiv

Benchmark

Abstract

Video-Holmes is a benchmark designed to evaluate the complex video reasoning capabilities of MLLMs.

Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films (ranging from 1 to 5 minutes), which spans seven carefully designed tasks. Each task is constructed by first identifying key events and causal relationships within films, and then designing questions that require models to actively locate and connect multiple relevant visual clues scattered across different video segments.

⭐ Key Aspects of Video-Holmes:

One-Click Evaluation: Videos, audios, questions, and evaluation codes are packaged on GitHub and Hugging Face.
High Reasoning Demand: Significant performance gap between reasoning models and non-reasoning models.
Reasoning Process Analysis: Clearly visualizes the reasons behind correct and incorrect model responses.

We aim that Video-Holmes can serve as a "Holmes-test" for multimodal reasoning, motivating models to reason more like humans and emphasizing the ongoing challenges in this field.

Teaser Image

LeaderBoard

Leaderboard of Video-Holmes, where SR means Social Reasoning; IMC means Intention and Motive Chaining; TCI means Temporal Causal Inference; TA Timeline Analysis; MHR means Multimodal Hint Reasoning; PAR means Physical Anomaly Reasoning; CTI means Core Theme Inference.

#	Model	Audio	SR	IMC	TCI	TA	MHR	PAR	CTI	Avg
1	Gemini-2.5-Pro 🥇	✅	54.8	54.3	53.8	56.0	48.8	46.4	44.8	51.3
2	Gemini-2.0-Flash-Thinking 🥈	✅	56.5	54.2	43.4	44.5	43.9	55.1	50.1	49.5
3	Gemini-1.5-Pro 🥉	✅	59.6	54.7	37.4	33.5	40.4	47.4	44.4	45.7
4	Gemini-2.5-Pro	❌	46.6	49.3	46.9	53.0	40.1	44.3	37.4	45.0
5	Gemini-2.0-Flash-Thinking	❌	43.4	46.9	43.1	51.0	37.9	43.6	39.3	43.1
6	GPT-4o	❌	50.0	49.6	38.8	30.0	44.0	39.2	37.0	42.0
7	Gemini-1.5-Pro	❌	52.1	48.2	34.4	26.0	39.2	46.4	38.9	41.2
8	Claud 3.5 Sonnet	❌	45.9	48.2	33.7	39.5	40.7	39.7	38.1	41.0
9	Video-RTS	❌	51.0	44.6	33.3	43.0	37.1	35.6	38.2	40.0
10	Claud 3.7 Sonnet	❌	48.6	43.5	30.8	41.0	39.8	36.6	33.7	39.3
11	Qwen2.5-VL-32B	❌	43.2	44.2	31.5	51.0	36.4	31.4	32.2	38.4
12	Video-R1	❌	48.6	41.7	28.9	34.5	31.0	33.5	35.9	36.5
13	SpaceR	❌	48.2	39.4	26.0	33.0	28.9	35.1	35.6	35.2
14	SEED-Bench-R1	❌	42.8	35.1	25.6	40.5	29.2	29.9	32.6	33.5
15	VideoChat-R1	❌	42.1	38.8	24.5	39.5	29.5	27.8	29.3	33.0
16	InternVL3-8B	❌	29.5	40.7	37.9	35.1	24.6	38.9	24.1	32.3
17	Gemini-2.0-Flash	❌	41.8	33.7	23.1	20.5	30.1	26.8	33.7	30.6
18	OpenAI o4-mini	❌	36.3	31.2	20.5	34.0	30.1	30.9	27.4	29.9
19	Qwen2.5-VL-7B	❌	38.4	34.8	17.6	30.0	27.1	18.6	25.2	27.8
20	Qwen2.5-Omni-7B	✅	38.4	30.8	22.3	12.0	21.1	21.1	20.7	24.4
21	InternVL2.5-8B	❌	27.8	32.1	21.2	7.6	25.4	23.6	22.4	23.6
22	Qwen2.5-Omni-7B	❌	27.1	19.9	13.9	7.5	14.8	14.9	13.7	16.4

Construction and Evaluation Pipeline

We select 270 high-quality suspense short films for human annotation. Next, we design 7 challenging tasks and employ DeepSeek to generate questions. Finally, we evaluate SOTA MLLMs and use DeepSeek to analyze their responses (optional).

Question Types

Existing benchmarks primarily involve clue-given questions, where models depend on explicitly provided clues to derive answers. In contrast, Video-Holmes adopts an active seeking paradigm, requiring models to actively locate and connect multiple relevant visual clues scattered across different video segments.

Examples

Examples of questions, explanations, model answers, and analyses of the reasoning process of Video-Holmes.

Citation

@article{cheng2025video,
  title={Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?},
  author={Cheng, Junhao and Ge, Yuying and Wang, Teng and Ge, Yixiao and Liao, Jing and Shan, Ying},
  journal={arXiv preprint arXiv:2505.21374},
  year={2025}
}

Original Source | Taken Source