Carview!

Model	Frames	Date	Overall QA (%)	Short Video QA (%)	Long Video QA (%)	Detailed Captioning
Human Performance	-	-	-	-	67.9	89.7	-	-	-
Random Chance	-	-	09.5	50.0	09.5	50.0	09.5	50.0	-
GPT-4o	64	2024-06-15	35.3	73.2	38.0	76.0	32.7	70.5	63.5
GPT-4o	32	2024-06-15	32.9	71.5	38.3	75.9	27.4	67.0	63.2
GPT-4o	16	2024-06-15	34.3	72.8	38.5	75.7	30.1	69.8	61.3
Gemini-1.5-Pro	1FPS	2024-08-01	25.6	66.4	26.6	67.5	24.7	65.2	56.5
Claude-3.5-Sonnet	16	2024-07-30	23.2	64.1	23.5	65.9	22.9	62.4	54.1
Claude-3.5-Sonnet	8	2024-07-30	24.1	65.0	23.6	65.5	24.5	64.6	53.1
LLaVA-Video-72B	32	2024-09-30	33.7	72.4	37.7	75.9	29.6	68.8	54.8
LLaVA-Video-7B	32	2024-09-30	22.9	63.6	22.9	63.3	22.9	63.9	52.1
Aria	32	2024-10-10	25.0	65.9	26.6	68.4	23.5	63.5	51.5
LongVU	1FPS	2024-10-22	18.9	58.5	20.9	61.7	16.9	55.3	40.5
Qwen2-VL-72B	32	2024-06-15	31.7	70.2	38.3	75.8	25.0	64.5	56.1
Qwen2-VL-72B	8	2024-06-15	30.1	68.9	34.0	73.1	26.2	64.7	51.4
Qwen2-VL-7B	32	2024-06-15	21.7	62.0	24.7	64.4	18.8	59.7	51.9
LLaVA-OneVision-72B	32	2024-08-08	26.6	66.6	30.7	70.5	22.4	62.7	53.9
LLaVA-OneVision-72B	8	2024-08-08	28.1	67.8	33.0	72.1	23.1	63.6	55.0
LLaVA-OneVision-7B	32	2024-08-08	18.7	59.4	21.2	61.9	16.2	56.9	50.1
LLaVA-NeXT-Video-34B	32	2024-04-30	19.9	61.1	22.0	64.0	17.7	58.2	53.1
LLaVA-NeXT-Video-7B	8	2024-04-30	20.5	61.2	23.6	65.1	17.3	57.2	50.1
InternLM-XC2.5	1FPS	2024-04-30	16.7	57.3	17.9	58.8	15.6	55.8	52.4
VideoLLaVA	8	2023-11-16	20.3	61.5	25.5	67.1	15.1	56.0	46.0
MiniCPM-V2.6	1FPS	2024-08-12	20.4	61.3	21.4	62.3	19.3	60.3	47.2
Phi-3.5-Vision	2	2024-08-16	15.5	56.2	16.9	58.0	14.1	54.4	42.9
MA-LMM	4	2024-04-08	09.1	47.4	09.2	48.0	09.0	46.9	38.7
M3	6	2024-05-27	13.3	54.7	14.8	56.4	11.8	53.1	47.8
GPT-4o	1	2024-06-15	26.4	67.3	28.4	70.0	24.5	64.7	52.3
LLaVA-1.5-13B	1	2023-10-05	13.7	55.1	13.1	55.7	14.2	54.5	47.9
LLaVA-1.5-7B	1	2023-10-05	15.3	56.8	18.3	60.5	12.3	53.2	45.7
LLaVA-NeXT-34B	1	2024-01-30	19.0	60.5	18.0	60.5	19.9	60.5	49.1
Phi-3-Vision	1	2024-05-19	15.4	55.2	15.1	54.4	15.6	56.0	42.0

Benchmark is from Human Annotation

Overview of the annotation pipeline for TemporalBench. In step 1, we fist collect high-quality captions for the videos using qualified AMT annotators followed by refining them. In step 2, we leverage existing LLMs to generate negative captions by replacing select words and reordering the sequence of actions before filtering them ourselves. centric.

TemporalBench Owns High-quality Negatives

Comparsion of negative captions generated from the original captions and our detailed captions in TemporalBench. With fine-grained details, the negatives are more difficult and temporal centric.

Different Question Types

All current LMMs have a large gap compared to human performance. Visualization of binary accuracy for short video QA per (a) subset and (b) negative type. Human performance is much better than GPT-4o, Qwen2-VL-72B, LLaVA-OneVision-72B, and Gemini-1.5-Pro.

More Frames Help, but Not Much

Model performance on TemporalBench with varying frames on short video understanding. With more frames, LMMs mostly perform better, but the improvement is limited.

A Pitfall in Multi-choice Question Answering

While developing our benchmark, we noticed another previously ignored but critical pitfall for multi-choice QA. Specifically, if every negative answer choice is generated by changing a small part of the correct answer, the LLM can detect those changes to find a centralized description and use that cue for its prediction. To study this, given a positive caption C and its associated negative caption N(C), we intentionally derive a few negatives from N_1(C) (instead of for C), resulting in N_1(N_1(C)) and N_2(N_1(C)), resulting in [C, N_1(C), N_1(N_1(C)), N_2(N_1(C))] as options, so that N_1(C) becomes the centralized description (see Fig.~ref{fig:negative_captions_generation}). Surprisingly, we find that 66.4% of text-only GPT-4o's predictions correspond to N(C), while only 6.4% of its predictions correspond to C. Our findings also align with human behavior analysis from psychology (Furman et al., 2008), where humans can achieve better than random chance performance on multi-choice QAs using similar cues.

Motivated by these findings, we propose to decompose a single multi-choice QA into multiple binary QAs. In this case, we eliminate the centralized option due to the fact that there are only two options to choose from. As a result, given M negatives, the multiple binary QAs will query a model M times, where the random chance performance changes from 1 / (M+1) to (1/2)^M. Given that (1/2)^M > 1 / (M+1) for every M > 2, multiple binary QA is a more difficult task than multi-choice QA.

GPT-4o Fails in Distinguishing Basic Temporal Dynamics


      @article{cai2024temporalbench,
        title={TemporalBench: Towards Fine-grained Temporal Understanding for Multimodal Video Models},
        author={Cai, Mu and Tan, Reuben and Zhang, Jianrui and Zou, Bocheng and Zhang, Kai and Yao, Feng and Zhu, Fangrui and Gu, Jing and Zhong, Yiwu and Shang, Yuzhang and Dou, Yao and Park, Jaden and Gao, Jianfeng and Lee, Yong Jae and Yang, Jianwei},
        journal={arXiv preprint arXiv:2410.10818},
        year={2024}
      }

TemporalBench

Benchmarking Fine-grained Temporal Understanding for
Multimodal Video Models

What is TemporalBench?

Introduction

Leaderboard

Benchmark

Data Examples

Benchmark Curation

Benchmark is from Human Annotation

TemporalBench Owns High-quality Negatives

Experiment Results

Different Question Types

More Frames Help, but Not Much

A Pitfall in Multi-choice Question Answering

GPT-4o Fails in Distinguishing Basic Temporal Dynamics

Citation

TemporalBench

Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

What is TemporalBench?

Introduction

Leaderboard

Benchmark

Data Examples

Benchmark Curation

Benchmark is from Human Annotation

TemporalBench Owns High-quality Negatives

Experiment Results

Different Question Types

More Frames Help, but Not Much

A Pitfall in Multi-choice Question Answering

GPT-4o Fails in Distinguishing Basic Temporal Dynamics

Citation

Benchmarking Fine-grained Temporal Understanding for
Multimodal Video Models