Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language

Evaluation Tasks

Overview of evaluation tasks. FAVOR-Bench comprises close-ended and open-ended evaluations. The close-ended evaluation is composed of six tasks, focusing on different aspects of fine-grained motion understanding. The open-ended evaluation comprises a GPT-assisted evaluation and a novel LLM-free framework. In the GPT-assisted evaluation, model responses are directly compared with manual captions. The LLM-free framework parses structured motion elements from responses and compares them with the structured annotations.

Statistics

Data statistics of FAVOR-Bench. Left: Task type distribution across close-ended and open-ended evaluation in FAVOR-Bench. Middle: Distribution of motion numbers per video. Right: The word cloud statistics of motion vocabularies in FAVOR-Bench.

More data statistics of FAVOR-Bench. Left: Index distribution of correct answers for the close-ended tasks. For example, ``(1)" indicates that the correct option is ranked first. Middle: Video duration distribution of FAVOR-Bench. Right: Question number distribution for videos of FAVOR-Bench.

Benchmark Comparison

data-composition

Comparison of FAVOR-Bench with existing video understanding benchmarks. #Videos and #Close-Ended QA refer to the number of videos and close-ended question-answer pairs respectively. FAVOR-Bench covers wide video types (Third-Person, Ego-Centric, Simulation) while focusing on fine-grained motion understanding. Moreover, FAVOR-Bench provides comprehensive evaluation, including close-ended QA and open-ended tasks (both GPT-assisted evaluation and our novel LLM-Free framework).

Experimental Results

Results on FAVOR-Bench

grade-lv

The overall performances of 21 MLLMs on FAVOR-Bench, including close-ended multiple choice and open-ended evaluation with GPT-assisted and LLM-free scores. GPT-C and GPT-D mean correctness and detailedness scores generated by GPT-4o. The highest and second-highest results among all MLLMs are indicated in bold and underlined. Due to the API response limitations, the video input of proprietary MLLMs is restricted to 16 frames if the video is longer than 16 seconds (demoted as "1 fps*".) Tarsier2-Recap-7B is a model specially designed for captioning and it fails to fulfill the close-ended evaluation.

Fine-tuning with FAVOR-Train

Comparison on TVBench and MotionBench with our proposed FAVOR-Train. "AVG" means the average score of all the 10 tasks of TVBench. "ALL" denotes the accuracy on all 4,018 questions of MotionBench-Dev. Qwen2.5-VL gains considerable performance improvement from fine-tuning with FAVOR-Train.

Question-Answer Examples

Citation

        @article{tu2025favor,
                  title={FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding},
                  author={Tu, Chongjun and Zhang, Lin and Chen, Pengtao and Ye, Peng and Zeng, Xianfang and Cheng, Wei and Yu, Gang and Chen, Tao},
                  journal={arXiv preprint arXiv:2503.14935},
                  year={2025}
                }
 
Original Source | Taken Source