Download with CLI:
huggingface-cli download anonymousdb/AIGVE-60K --repo-type dataset --local-dir ./AIGVE-60K
This leaderboard presents the performance of 30 models on the AIGVE-60K benchmark, evaluating across three key dimensions:
- 🎨 Perceptual Quality
- 🔄 Text-to-Video Correspondence
- ❓ Task-specific Accuracy
Final Overall Rank is computed by summing the individual ranks across these three dimensions. The top 3 models are marked with 🥇🥈🥉. All model names are bolded for clarity.
🏆 Overall Rank | Model | 🎨 Perception MOS | 🔗 Rank | 🔄 Correspondence MOS | 🔗 Rank | ❓ Task-specific Accuracy (%) | 🔗 Rank |
---|---|---|---|---|---|---|---|
1 | 🥇 Pixverse | 63.81 | 2 | 59.97 | 2 | 91.33 | 1 |
2 | 🥈 Wanxiang | 60.54 | 7 | 60.37 | 1 | 90.33 | 2 |
3 | 🥉 Hailuo | 60.58 | 5 | 59.74 | 3 | 87.67 | 3 |
4 | Jimeng | 65.25 | 1 | 57.86 | 6 | 81.33 | 6 |
5 | Sora | 62.09 | 4 | 59.68 | 4 | 85.67 | 5 |
6 | Hunyuan | 58.81 | 9 | 57.25 | 7 | 79.67 | 7 |
7 | Vidu1.5 | 54.56 | 15 | 58.25 | 5 | 87 | 4 |
8 | Gen3 | 59.22 | 8 | 55.72 | 8 | 75.33 | 9 |
9 | Kling | 60.56 | 6 | 55.57 | 9 | 73.67 | 11 |
10 | Genmo | 57.66 | 11 | 53.78 | 11 | 75.67 | 8 |
11 | ChatGLM | 56.39 | 13 | 53.98 | 10 | 74 | 10 |
12 | Xunfei | 58.6 | 10 | 53.46 | 12 | 66.33 | 12 |
13 | Pyramid | 63.67 | 3 | 50.17 | 16 | 50.17 | 22 |
14 | Wan2.1 | 57.27 | 12 | 52.33 | 13 | 62.67 | 16 |
15 | Allegro | 56.08 | 14 | 50.7 | 15 | 63 | 15 |
16 | VideoCrafter2 | 48.11 | 19 | 51.07 | 14 | 65.67 | 13 |
17 | CogVideo X1.5 | 50.59 | 16 | 49.73 | 17 | 64.67 | 14 |
18 | Animate | 50.48 | 17 | 49.3 | 18 | 60.67 | 17 |
19 | Lavie | 49.3 | 18 | 48.22 | 19 | 55 | 20 |
20 | Hotshot-XL | 42.66 | 22 | 47.75 | 20 | 57.67 | 18 |
21 | Latte | 43.81 | 21 | 46.73 | 22 | 54.33 | 21 |
22 | VideoCrafter1 | 44.12 | 20 | 44.67 | 24 | 46 | 25 |
23 | Text2Video-Zero | 40.53 | 24 | 44.89 | 23 | 48.67 | 23 |
24 | NOVA | 41.18 | 23 | 47.18 | 21 | 56 | 19 |
25 | ModelScope | 38 | 26 | 43.73 | 25 | 47.33 | 24 |
26 | Tune-A-Video | 35.41 | 27 | 42.69 | 26 | 43 | 26 |
27 | LTX | 40.11 | 25 | 41.28 | 28 | 37 | 28 |
28 | LVDM | 33.84 | 28 | 42.2 | 27 | 40.33 | 27 |
29 | ZeroScope | 30.08 | 29 | 34.69 | 29 | 22 | 29 |
30 | LWM | 27.39 | 30 | 31.49 | 30 | 9 | 30 |
Clone the repository:
git clone https://github.com/IntMeGroup/LOVE.git
Create and activate a conda environment:
conda create -n LOVE python=3.9 -y
conda activate LOVE
Install dependencies:
pip install -r requirements.txt
Install flash-attn==2.3.6
(pre-built):
pip install flash-attn==2.3.6 --no-build-isolation
Or compile from source:
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v2.3.6
python setup.py install
huggingface-cli download anonymousdb/AIGVE-60K data.zip --repo-type dataset --local-dir ./
unzip data.zip -d ./data
huggingface-cli download OpenGVLab/InternVL3-9B --local_dir OpenGVLab/InternVL3-9B
huggingface-cli download anonymousdb/LOVE-pretrain temporal.pth ./
sh shell/st1_train.sh
sh shell/st2_train.sh
sh shell/train_qa.sh
huggingface-cli download anonymousdb/LOVE-Perception --local-dir ./weights/stage2/stage2_mos1
huggingface-cli download anonymousdb/LOVE-Correspondence --local-dir ./weights/stage2/stage2_mos2
huggingface-cli download anonymousdb/LOVE-QA --local-dir ./weights/qa
sh shell/eval_score.sh
sh shell/eval_qa.sh
huggingface-cli download anonymousdb/LOVE-Perception --local-dir ./weights/stage2/stage2_mos1
huggingface-cli download anonymousdb/LOVE-Correspondence --local-dir ./weights/stage2/stage2_mos2
- Refine the /data/infer_perception.json file with the correct path:
"root": your_path_to_videos
or infer selected videos in video_names.txt 2. Refine the /data/data/infer_perception2.json file with the correct path:
"root": your_path_to_videos
"video_name_txt": video_names.txt
and change the shell/infer_perception.sh line30 to data/infer_perception2.json 3. Refine the /data/infer_correspondcence.json file with the correct path:
"root": your_path_to_videos
"video_name_txt": "video_names.txt",
"prompt_txt": "prompt.txt",
Refine the shell/infer_perception.sh line27 to your_download_model_pretrained_weight_path
sh shell/infer_perception.sh
Refine the shell/infer_correspondence.sh line27 to your_download_model_pretrained_weight_path
sh shell/infer_correspondence.sh
This leaderboard presents the performance of 48 models on the AIGVE-60K benchmark, evaluating across three key dimensions:
- 🎨 Perception SRCC: Spearman correlation of perceptual quality.
- 🔄 Correspondence SRCC: Spearman correlation of text-to-video alignment.
- ❓ QA Accuracy: Accuracy of question answering.
Final Overall Rank is computed by summing the individual ranks across these three dimensions. The top 3 models are marked with 🥇🥈🥉. All model names are bolded for clarity.
🏆 Rank | Method | Perception SRCC | 🔗 Rank | Correspondence SRCC | 🔗 Rank | QA Acc (%) | 🔗 Rank |
---|---|---|---|---|---|---|---|
🥇 | LOVE (Ours) | 0.7932 | 1 | 0.7466 | 1 | 78.69 | 1 |
🥈 | InternVL2.5 (38B) | 0.6227 | 6 | 0.6470 | 5 | 75.81 | 3 |
🥉 | Grok2 Vision | 0.5628 | 11 | 0.6659 | 2 | 76.51 | 2 |
4 | InternVL2.5 (72B) | 0.5383 | 16 | 0.6612 | 4 | 75.18 | 4 |
5 | InternVL3 (72B) | 0.5441 | 13 | 0.6314 | 6 | 74.59 | 6 |
6 | ChatGPT-4o | 0.5263 | 18 | 0.6639 | 3 | 74.84 | 5 |
7 | Gemini1.5-pro | 0.4972 | 21 | 0.6095 | 8 | 73.38 | 9 |
7 | InternVL3 (38B) | 0.4950 | 22 | 0.5996 | 9 | 73.89 | 7 |
9 | Llava-one-vision (72B) | 0.5291 | 17 | 0.5702 | 12 | 73.31 | 10 |
10 | Qwen2.5-VL (72B) | 0.4245 | 26 | 0.6272 | 7 | 73.83 | 8 |
11 | Claude3.5 | 0.4267 | 25 | 0.5827 | 11 | 73.20 | 11 |
12 | Qwen2-VL (72B) | 0.4628 | 24 | 0.5598 | 13 | 73.12 | 12 |
12 | FGA-BLIP2 | 0.5181 | 19 | 0.5962 | 10 | 67.06 | 20 |
14 | HPSv2 | 0.5415 | 14 | 0.4989 | 17 | 67.68 | 19 |
15 | FAST-VQA | 0.6391 | 5 | 0.3919 | 29 | 66.27 | 22 |
15 | HOSA | 0.6474 | 3 | 0.4153 | 24 | 64.34 | 29 |
17 | ImageReward | 0.4180 | 27 | 0.5076 | 16 | 68.33 | 18 |
18 | QAC | 0.5958 | 7 | 0.3948 | 27 | 64.40 | 28 |
19 | NIQE | 0.6536 | 2 | 0.4345 | 22 | 62.21 | 39 |
20 | BRISQUE | 0.5843 | 8 | 0.3806 | 30 | 64.67 | 26 |
21 | AestheticScore | 0.5524 | 12 | 0.3931 | 28 | 64.87 | 25 |
22 | Qwen2-VL (7B) | 0.3568 | 32 | 0.4498 | 21 | 71.56 | 13 |
23 | Qwen2.5-VL (7B) | 0.5410 | 15 | 0.5110 | 15 | 62.34 | 37 |
24 | VideoLlama3 (8B) | 0.3922 | 30 | 0.4228 | 23 | 70.16 | 16 |
25 | DOVER | 0.6414 | 4 | 0.3759 | 31 | 62.61 | 35 |
26 | BMPRI | 0.5741 | 9 | 0.3618 | 32 | 64.00 | 30 |
27 | V-Aesthetic Quality | 0.5031 | 20 | 0.4033 | 26 | 64.54 | 27 |
28 | LLaVA-NeXT (8B) | 0.4888 | 23 | 0.2847 | 36 | 70.21 | 15 |
29 | InternVideo2.5 (8B) | 0.1563 | 43 | 0.4978 | 18 | 70.64 | 14 |
30 | InternVL2.5 (8B) | 0.2799 | 38 | 0.4856 | 19 | 66.30 | 21 |
31 | mPLUG-Owl3 (7B) | 0.3532 | 34 | 0.5478 | 14 | 63.02 | 34 |
31 | InternVL3 (9B) | 0.2731 | 39 | 0.4768 | 20 | 65.82 | 23 |
33 | SimpleVQA | 0.5631 | 10 | 0.3474 | 33 | 60.78 | 42 |
34 | PickScore | 0.4026 | 29 | 0.4135 | 25 | 62.29 | 38 |
35 | VideoLlava (7B) | 0.1809 | 41 | 0.2005 | 41 | 68.46 | 17 |
36 | V-Temporal Flickering | 0.4076 | 28 | 0.1958 | 42 | 63.69 | 32 |
37 | BLIPScore | 0.1884 | 40 | 0.3163 | 34 | 63.93 | 31 |
38 | BPRI | 0.3558 | 33 | 0.2018 | 40 | 63.56 | 33 |
39 | VSFA | 0.3750 | 31 | 0.2438 | 37 | 57.09 | 46 |
40 | CogAgent (18B) | 0.1244 | 45 | 0.1190 | 46 | 65.32 | 24 |
41 | V-Subject Consistency | 0.3443 | 35 | 0.1647 | 45 | 62.52 | 36 |
42 | BVQA | 0.3089 | 36 | 0.2379 | 38 | 58.47 | 44 |
43 | V-Overall Consistency | 0.1559 | 44 | 0.3076 | 35 | 61.96 | 41 |
44 | V-Imaging Quality | 0.2810 | 37 | 0.1952 | 43 | 60.60 | 43 |
45 | CLIPScore | 0.0947 | 46 | 0.2290 | 39 | 58.27 | 45 |
46 | VQAScore | 0.1677 | 42 | 0.1763 | 44 | 52.97 | 47 |
47 | Llama3.2-Vision (11B) | 0.0940 | 47 | 0.0804 | 47 | 62.19 | 40 |
48 | DeepseekVL2 (1B) | 0.0121 | 48 | 0.0173 | 48 | 39.29 | 48 |
🏆 Rank | Method | Perception SRCC | 🔗 Rank | Correspondence SRCC | 🔗 Rank | QA Acc (%) | 🔗 Rank |
---|---|---|---|---|---|---|---|
🥇 | LOVE (Ours) | 0.9324 | 1 | 0.9778 | 1 | 0.98 | 1 |
🥈 | InternVL2.5 (38B) | 0.9052 | 3 | 0.9586 | 2 | 0.95 | 6 |
🥉 | InternVL3 (72B) | 0.8923 | 7 | 0.9444 | 8 | 0.96 | 2 |
4 | Grok2 Vision | 0.8808 | 10 | 0.9546 | 4 | 0.95 | 5 |
5 | InternVL2.5 (72B) | 0.8843 | 9 | 0.9542 | 5 | 0.94 | 7 |
6 | FGA-BLIP2 | 0.8954 | 5 | 0.9502 | 6 | 0.94 | 10 |
7 | ChatGPT-4o | 0.9048 | 4 | 0.9458 | 7 | 0.93 | 11 |
8 | Gemini1.5-pro | 0.8790 | 11 | 0.9430 | 10 | 0.95 | 4 |
9 | VideoLlama3 (8B) | 0.9073 | 2 | 0.9075 | 16 | 0.82 | 19 |
10 | InternVL3 (38B) | 0.8118 | 20 | 0.9439 | 9 | 0.94 | 9 |
11 | Qwen2.5-VL (72B) | 0.7762 | 28 | 0.9364 | 13 | 0.95 | 3 |
12 | Qwen2-VL (72B) | 0.8388 | 16 | 0.9271 | 15 | 0.91 | 14 |
13 | FAST-VQA | 0.8945 | 6 | 0.8376 | 20 | 0.81 | 20 |
14 | LLaVA-NeXT (8B) | 0.8785 | 12 | 0.8042 | 23 | 0.92 | 12 |
15 | mPLUG-Owl3 (7B) | 0.7962 | 24 | 0.9310 | 14 | 0.89 | 15 |
16 | InternVL3 (9B) | 0.8300 | 17 | 0.9373 | 12 | 0.77 | 25 |
17 | Claude3.5 | 0.7602 | 30 | 0.8919 | 17 | 0.94 | 8 |
18 | InternVL2.5 (8B) | 0.7882 | 25 | 0.9390 | 11 | 0.81 | 21 |
19 | Llava-one-vision (72B) | 0.7829 | 27 | 0.8741 | 18 | 0.91 | 13 |
20 | DOVER | 0.8874 | 8 | 0.8038 | 24 | 0.77 | 26 |
21 | ImageReward | 0.8016 | 23 | 0.8549 | 19 | 0.86 | 17 |
22 | InternVideo2.5 (8B) | 0.3361 | 44 | 0.9560 | 3 | 0.84 | 18 |
23 | NIQE | 0.8412 | 15 | 0.7838 | 26 | 0.76 | 27 |
24 | PickScore | 0.8198 | 18 | 0.7775 | 28 | 0.78 | 22 |
25 | HOSA | 0.8456 | 14 | 0.7780 | 27 | 0.76 | 28 |
26 | Qwen2.5-VL (7B) | 0.8652 | 13 | 0.8167 | 22 | 0.67 | 36 |
27 | Qwen2-VL (7B) | 0.7085 | 33 | 0.7953 | 25 | 0.87 | 16 |
28 | QAC | 0.8100 | 21 | 0.7717 | 29 | 0.75 | 30 |
29 | BRISQUE | 0.8131 | 19 | 0.7615 | 30 | 0.74 | 31 |
30 | HPSv2 | 0.7504 | 32 | 0.7522 | 31 | 0.78 | 23 |
31 | CogAgent (18B) | 0.4834 | 41 | 0.8198 | 21 | 0.78 | 24 |
32 | SimpleVQA | 0.8038 | 22 | 0.7273 | 33 | 0.69 | 34 |
33 | BMPRI | 0.7878 | 26 | 0.7321 | 32 | 0.70 | 33 |
34 | V-Aesthetic Quality | 0.7740 | 29 | 0.7273 | 33 | 0.70 | 32 |
35 | AestheticScore | 0.7566 | 31 | 0.7001 | 35 | 0.67 | 35 |
36 | VideoLlava (7B) | 0.6125 | 37 | 0.6406 | 36 | 0.75 | 29 |
37 | V-Temporal Flickering | 0.6396 | 34 | 0.5778 | 38 | 0.58 | 38 |
38 | VSFA | 0.6227 | 36 | 0.5858 | 37 | 0.52 | 39 |
39 | BPRI | 0.6356 | 35 | 0.5324 | 39 | 0.47 | 41 |
40 | BVQA | 0.5030 | 39 | 0.4674 | 41 | 0.48 | 40 |
41 | V-Imaging Quality | 0.5426 | 38 | 0.4986 | 40 | 0.44 | 43 |
42 | V-Subject Consistency | 0.4839 | 40 | 0.4416 | 42 | 0.45 | 42 |
43 | Llama3.2-Vision (11B) | 0.4483 | 42 | 0.2783 | 46 | 0.60 | 37 |
44 | VQAScore | 0.3437 | 43 | 0.3922 | 43 | 0.33 | 46 |
45 | BLIPScore | 0.2111 | 45 | 0.3451 | 44 | 0.38 | 44 |
46 | V-Overall Consistency | 0.1742 | 46 | 0.3201 | 45 | 0.34 | 45 |
47 | CLIPScore | 0.0300 | 48 | 0.1408 | 47 | 0.17 | 47 |
48 | DeepseekVL2 (1B) | 0.0607 | 47 | 0.0785 | 48 | 0.10 | 48 |
This section lists 30 representative T2V generation models, including both commercial close-source models and open-source lab models, with links to their official or GitHub pages.
Model | URL |
---|---|
Pixverse | https://pixverse.ai/ |
Wanxiang | https://tongyi.aliyun.com/wanxiang/ |
Hailuo | https://hailuoai.video/ |
Jimeng | https://jimeng.jianying.com/ |
Sora | https://openai.com/research/video-generation-models-as-world-simulators |
Hunyuan | https://aivideo.hunyuan.tencent.com/ |
Vidu1.5 | https://www.vidu.studio/zh |
Gen3 | https://runwayml.com/research/introducing-gen-3-alpha |
Kling | https://klingai.io/ |
Genmo | https://www.genmo.ai |
ChatGLM | https://chatglm.cn/video?lang=zh |
Xunfei | https://typemovie.art/ |
Model | URL |
---|---|
Pyramid | https://github.com/jy0205/Pyramid-Flow |
Wan2.1 | https://github.com/FoundationVision/LlamaGen |
Allegro | https://github.com/rhymes-ai/Allegro |
VideoCrafter2 | https://github.com/AILab-CVC/VideoCrafter |
CogVideo X1.5 | https://github.com/THUDM/CogVideo |
Animate | https://github.com/aigc-apps/EasyAnimate |
Lavie | https://github.com/Vchitect/LaVie |
Hotshot-XL | https://github.com/hotshotco/Hotshot-XL |
Latte | https://github.com/Vchitect/Latte |
VideoCrafter1 | https://github.com/AILab-CVC/VideoCrafter |
Text2Video-Zero | https://github.com/Picsart-AI-Research/Text2Video-Zero |
NOVA | https://github.com/baaivision/NOVA |
ModelScope | https://github.com/modelscope/modelscope |
Tune-A-Video | https://github.com/showlab/Tune-A-Video |
LTX | https://github.com/Lightricks/LTX-Video |
LVDM | https://github.com/YingqingHe/LVDM |
ZeroScope | https://huggingface.co/cerspense/zeroscope_v2_XL |
LWM | https://github.com/LargeWorldModel/LWM |
This repository provides a comprehensive list of ** Vision-to-Text (V2T) interpretation models**, covering conventional video quality assessment models, learning-based image-text alignment models, large multimodal models (LMMs), and proprietary foundation models. Each method is annotated with its category and accompanied by a corresponding GitHub or official URL.
Conventional VQA Metrics
♣️ Classical VQA Models- ❤️ Learning-based Scoring Models
- ⭐ Large Multimodal Models (LMMs)
- 🔺 Proprietary Foundation Models (Closed-source)
run:
mos15.m
For BMPRI, BPRI, BRISQUE, HOSA, NIQE, QAC run:
videobench.m
Category | Method | URL |
---|---|---|
VSFA | GitHub | |
BVQA | GitHub | |
SimpleVQA | GitHub | |
FAST-VQA | GitHub | |
DOVER | GitHub | |
❤️ | CLIPScore | GitHub |
❤️ | BLIPScore | GitHub |
❤️ | AestheticScore | GitHub |
❤️ | ImageReward | GitHub |
❤️ | PickScore | GitHub |
❤️ | HPSv2 | GitHub |
❤️ | VQAScore | GitHub |
❤️ | FGA-BLIP2 | GitHub |
⭐ | DeepSeek-VL2 | GitHub |
⭐ | Video-LLaVA | GitHub |
⭐ | VideoLLaMA3 | GitHub |
⭐ | mPLUG-OWL3 | GitHub |
⭐ | Qwen2.5-VL | GitHub |
⭐ | LLaMA-3.2-Vision | HuggingFace |
⭐ | CogAgent | GitHub |
⭐ | LLaVA-NeXT | GitHub |
⭐ | InternVideo2.5 | GitHub |
⭐ | InternVL | GitHub |
🔺 | Gemini 1.5 Pro | Official |
🔺 | Claude 3.5 | Official |
🔺 | Grok2 Vision | Official |
🔺 | ChatGPT-4o | Official |
The current rankings are based on data we obtained from randomly-selected professional annotators, and we do not intend to offend the developers of these excellent T2V and V2T models. Although our model shows promising scalability in evaluating AIGVs generated by new prompts and previously unseen T2V models, the effectiveness in real-world applications remains an open question.
We hope that our benchmark and dataset will contribute to the advancement of:
- 🎨 T2V Generation
- 📊 T2V Evaluation
- 🔁 V2T Interpretation
Thanks to the original authors of all the models listed here. This is a curated list intended to help researchers and developers in the T2V generation,V2T interpretation and multimodal quality assessment communities.
- ✅ Release the training code
- ✅ Release the evaluation code
- ✅ Release the AIGVE-60K Database