| CARVIEW |
Benchmark Curation
Benchmark is from Human Annotation
Overview of the annotation pipeline for TemporalBench. In step 1, we fist collect high-quality captions for the videos using qualified AMT annotators followed by refining them. In step 2, we leverage existing LLMs to generate negative captions by replacing select words and reordering the sequence of actions before filtering them ourselves. centric.
TemporalBench Owns High-quality Negatives
Comparsion of negative captions generated from the original captions and our detailed captions in TemporalBench. With fine-grained details, the negatives are more difficult and temporal centric.
Experiment Results
Different Question Types
All current LMMs have a large gap compared to human performance. Visualization of binary accuracy for short video QA per (a) subset and (b) negative type. Human performance is much better than GPT-4o, Qwen2-VL-72B, LLaVA-OneVision-72B, and Gemini-1.5-Pro.
More Frames Help, but Not Much
Model performance on TemporalBench with varying frames on short video understanding. With more frames, LMMs mostly perform better, but the improvement is limited.
A Pitfall in Multi-choice Question Answering
While developing our benchmark, we noticed another previously ignored but critical pitfall for multi-choice QA. Specifically, if every negative answer choice is generated by changing a small part of the correct answer, the LLM can detect those changes to find a centralized description and use that cue for its prediction. To study this, given a positive caption C and its associated negative caption N(C), we intentionally derive a few negatives from N_1(C) (instead of for C), resulting in N_1(N_1(C)) and N_2(N_1(C)), resulting in [C, N_1(C), N_1(N_1(C)), N_2(N_1(C))] as options, so that N_1(C) becomes the centralized description (see Fig.~ref{fig:negative_captions_generation}). Surprisingly, we find that 66.4% of text-only GPT-4o's predictions correspond to N(C), while only 6.4% of its predictions correspond to C. Our findings also align with human behavior analysis from psychology (Furman et al., 2008), where humans can achieve better than random chance performance on multi-choice QAs using similar cues.
Motivated by these findings, we propose to decompose a single multi-choice QA into multiple binary QAs. In this case, we eliminate the centralized option due to the fact that there are only two options to choose from. As a result, given M negatives, the multiple binary QAs will query a model M times, where the random chance performance changes from 1 / (M+1) to (1/2)^M. Given that (1/2)^M > 1 / (M+1) for every M > 2, multiple binary QA is a more difficult task than multi-choice QA.
GPT-4o Fails in Distinguishing Basic Temporal Dynamics
Citation
@article{cai2024temporalbench,
title={TemporalBench: Towards Fine-grained Temporal Understanding for Multimodal Video Models},
author={Cai, Mu and Tan, Reuben and Zhang, Jianrui and Zou, Bocheng and Zhang, Kai and Yao, Feng and Zhu, Fangrui and Gu, Jing and Zhong, Yiwu and Shang, Yuzhang and Dou, Yao and Park, Jaden and Gao, Jianfeng and Lee, Yong Jae and Yang, Jianwei},
journal={arXiv preprint arXiv:2410.10818},
year={2024}
}