| CARVIEW |
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation
Kaiyue Sun1 Kaiyi Huang1 Xian Liu2 Yue Wu3 Zihan Xu1 Zhenguo Li3 Xihui Liu1
1 The University of Hong Kong 2 The Chinese University of Hong Kong 3 Huawei Noah's Ark Lab
[Paper] [Code] [Join our Leaderboard!]
T2V-CompBench Prompt Suite.
Overview:
-
✔️ T2V-CompBench: We conduct the first systematic study on compositional text-to-video generation and propose this benchmark.
✔️ 1400 Prompts: We analyze 1.67 million real-user prompts to extract high-frequency nouns, verbs, and adjectives, resulting in 1,400 prompts.
✔️ 7 Categories: We evaluate multiple-object compositionality on attributes, quantities, and spatio-temporal dynamics, covering 7 categories.
✔️ Evaluation metrics: We design MLLM-based, Detection-based, and Tracking-based evaluation metrics for compositional T2V generation, all validated by human evaluations.
✔️ Valuable Insights: We benchmark 20+ text-to-video generation models, provide insightful analysis on current models' ability, highlighting the significant challenge of compositional T2V generation.
Introduction
Evaluation Metrics
MLLM-based evaluation metrics for consistent and dynamic attribute binding, action binding and object interactions.
Detection-based evaluation metrics for spatial relationships and object interactions.
Tracking-based evaluation metrics for motion binding.
Evaluation Results
Benchmarking T2V Models with a radar chart.
T2V-CompBench evaluation results for 23 T2V generation models (17 open-source models and 6 commercial models).
Bibtex



