We introduce ShotBench, a comprehensive benchmark for evaluating VLMs’ understanding of cinematic language. It comprises over 3.5k expert-annotated QA pairs derived from images and video clips of over 200 critically acclaimed films (predominantly Oscar-nominated), covering eight distinct cinematography dimensions. This provides a rigorous new standard for assessing fine-grained visual comprehension in film.
We conducted an extensive evaluation of 24 leading VLMs, including prominent open-source and proprietary models, on ShotBench. Our results reveal a critical performance gap: even the most capable model, GPT-4o, achieves less than 60% average accuracy. This systematically quantifies the current limitations of VLMs in genuine cinematographic comprehension.
To address the identified limitations and facilitate future research, we constructed ShotQA, the first large-scale multimodal dataset for cinematography understanding, containing approximately 70k high-quality QA pairs. Leveraging ShotQA, we developed ShotVL, a novel VLM trained using Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). ShotVL significantly surpasses all tested open-source and proprietary models, establishing a new state-of-the-art on ShotBench.
Here we show the overview of ShotBench. The benchmark covers eight core dimensions of cinematography: shot size, framing, camera angle, lens size, lighting type, lighting condition, composition, and camera movement.
Here we show some samples from ShotBench and ShotQA dataset. Please download the dataset from Huggingface to obtain the full version.
Push in
Zoom in
Pull out
Zoom out
Dolly zoom
Dolly zoom
Arc
Arc
Tilt down
Tilt up
Static shot
Static shot
Trucking left
Trucking right
Boom down
Boom up
Evaluation
We report the evaluation results for 24 VLMs and ShotVL below, ShotVL sets new SOTA overall performance across all evaluated models.
Abbreviations adopted:
SS = Shot Size,
SF = Shot Framing,
CA = Camera Angle,
LS = Lens Size,
LT = Lighting Type,
LC = Lighting Conditions,
SC = Shot Composition,
CM = Camera Movement.
underline marks previous best performance in each group.
Our ShotVL models establish new SOTA and set up a strong baseline for future research.
Models
SS
SF
CA
LS
LT
LC
SC
CM
Avg
Open-Sourced VLMs
Qwen2.5-VL-3B-Instruct
54.6
56.6
43.1
36.6
59.3
45.1
41.5
31.9
46.1
Qwen2.5-VL-7B-Instruct
69.1
73.5
53.2
47.0
60.5
47.4
49.9
30.2
53.8
LLaVA-NeXT-Video-7B
35.9
37.1
32.5
27.8
50.9
31.7
28.0
31.3
34.4
LLaVA-Video-7B-Qwen2
56.9
65.4
45.1
36.0
63.5
45.4
37.4
35.3
48.1
LLaVA-Onevision-Qwen2-7B-Ov-Chat
58.4
71.0
52.3
38.7
59.5
44.9
50.9
39.7
51.9
InternVL2.5-8B
56.3
70.3
50.8
41.1
60.2
45.1
50.1
33.6
50.9
InternVL3-2B
56.3
56.0
44.4
34.6
56.8
44.6
43.0
38.1
46.7
InternVL3-8B
62.1
65.8
46.8
42.9
58.0
44.3
46.8
44.2
51.4
InternVL3-14B
59.6
82.2
55.4
40.7
61.7
44.6
51.1
38.2
54.2
Internlm-xcomposer2d5-7B
51.1
71.0
39.8
32.7
59.3
35.7
35.7
38.8
45.5
Ovis2-8B
35.9
37.1
32.5
27.8
50.9
31.7
28.0
35.3
34.9
VILA1.5-3B
33.4
44.9
32.1
28.6
50.6
35.7
28.4
21.5
34.4
VILA1.5-8B
40.6
44.5
39.1
29.7
48.9
32.9
34.4
36.9
38.4
VILA1.5-13B
36.7
54.6
40.7
34.8
52.8
35.4
34.2
31.3
40.1
Instructblip-vicuna-7B
27.0
27.9
34.5
29.4
44.4
29.7
27.1
25.0
30.6
Instructblip-vicuna-13B
26.8
29.2
27.9
28.0
39.0
24.0
27.1
22.0
28.0
InternVL2.5-38B
67.8
85.4
55.4
41.7
61.7
48.9
52.4
44.0
57.2
InternVL3-38B
68.0
84.0
51.9
43.6
64.4
46.9
54.7
44.6
57.3
Qwen2.5-VL-32B-Instruct
62.3
76.6
51.0
48.3
61.7
44.0
52.2
43.8
55.0
Qwen2.5-VL-72B-Instruct
75.1
82.9
56.7
46.8
59.0
49.4
54.1
48.9
59.1
InternVL3-78B
69.7
80.0
54.5
44.0
65.5
47.4
51.8
44.4
57.2
Proprietary VLMs
Gemini-2.0-flash
48.9
75.5
44.6
31.9
62.2
48.9
52.4
47.4
51.5
Gemini-2.5-flash-preview-04-17
57.7
82.9
51.4
43.8
65.2
45.7
45.9
43.5
54.5
GPT-4o
69.3
83.1
58.2
48.9
63.2
48.0
55.2
48.3
59.3
Ours
ShotVL-3B
77.9
85.6
68.8
59.3
65.7
53.1
57.4
51.7
65.1
ShotVL-7B
81.2
90.1
78.0
68.5
70.1
64.3
45.7
62.9
70.1
Analysis and Findings
(1) Approximately half of the evaluated models attain an overall accuracy below 50%. Even the leading models like GPT-4o, fail to reach 60% accuracy, underscoring the significant gap between current VLMs and a true understanding of cinematography.
(2) The overall performance differences between open-source and proprietary models are marginal.
(3) Within each series, larger models generally achieve higher accuracy.
(4) Stronger models perform well uniformly, without specific dimensional weaknesses.
(5) Fine-tuning with SFT and GRPO on the proposed ShotQA dataset effectively enhances the model's capability in cinematography understanding. Notably, the sequential training strategy of applying GRPO following SFT yields the best performance.
Overall performance comparison of InternVL3, Qwen2.5-VL, and VILA-1.5 model families,
highlighting variations by model size.
Performance of six Vision-Language Models (VLMs) across cinematographic dimensions;
stronger models show uniformly high scores without obvious weak spots.
BibTeX
@misc{
liu2025shotbench,
title={ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models},
author={Hongbo Liu and Jingwen He and Yi Jin and Dian Zheng and Yuhao Dong and Fan Zhang and Ziqi Huang and Yinan He and Yangguang Li and Weichao Chen and Yu Qiao and Wanli Ouyang and Shengjie Zhao and Ziwei Liu},
year={2025},
eprint={2506.21356},
achivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.21356},
}