| CARVIEW |
🔔News
🚀[2024-10-20]: MixEval-X is released! Checkout the Paper and Leaderboard to learn more about this real-world any-to-any benchmark!
Introduction
Perceiving and generating diverse modalities are crucial for AI models to effec- tively learn from and engage with real-world signals, necessitating reliable eval- uations for their development. We identify two major issues in current evalua- tions: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generaliza- tion biases. To address these, we introduce MixEval-X, the first any-to-any, real-world benchmark designed to optimize and standardize evaluations across di- verse input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions. Meanwhile, MixEval-X’s model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98) while being much more efficient. We provide comprehensive leaderboards to rerank ex- isting models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.
TL;DR: MixEval-X is the first any-to-any, real-world benchmark featuring diverse input-output modalities, real-world task distributions, consistent high standards across modalities, and dynamism. It achieves up to 0.98 correlation with arena-like multi-modal evaluations while being way more efficient.
MixEval-X
Why to Use MixEval-X Benchmarks?
🥇 It extends all the benefits of MixEval to multi-modal evaluations, including comprehensive and less biased query distribution; fair grading (except open-ended tasks); dynamism; accurate model ranking; fast, cost-effective, reproducible execution; and challenging nature.
🥇 It establishes unified, high standards across modalities and communities. For single-modality models, it ensures its evaluation keeps up with the state-of-the-art standards; for multi-modality models, it ensures consistent, high-standard evaluations across modalities, preventing any from becoming a bottleneck.
🥇 Beyond model evaluation, MixEval-X benchmarks different organizations (as shown in the first Figure) with balanced dimensions (modalities), unlocking a new level of evaluation.
Statistics
Leaderboards
🌆
Image2Text📽️
Video2Text🎧
Audio2Text🌆
Text2Image📽️
Text2Video🎧
Text2Audio🧑🦯
Text2Action🚶
Image2ActionMixEval-X Image2Text Leaderboard
| Image2Text 🥇 |
Image2Text-Hard 🥇 |
SEED (Mixed) |
MMMU (Mixed) |
DocVQA (Mixed) |
TextVQA (Mixed) |
VisWiz (Mixed) |
InfographicVQA (Mixed) |
SEED-Hard (Mixed) |
MMMU-Hard (Mixed) |
|
| Claude 3.5 Sonnet | 76.9 | 46.2 | 76.0 | 75.1 | 94.6 | 90.3 | 62.5 | 78.8 | 31.0 | 48.9 |
| GPT-4o | 76.6 | 45.8 | 75.6 | 74.1 | 87.4 | 90.9 | 66.9 | 79.0 | 29.3 | 45.9 |
| GPT-4V | 75.0 | 44.6 | 75.6 | 68.0 | 92.1 | 89.3 | 53.7 | 79.2 | 31.9 | 40.6 |
| Qwen2-VL-72B | 74.8 | 43.4 | 71.5 | 67.5 | 90.6 | 90.3 | 66.3 | 80.4 | 25.4 | 27.8 |
| Gemini 1.5 Pro | 74.2 | 42.2 | 72.2 | 77.2 | 85.6 | 86.8 | 63.7 | 76.7 | 29.7 | 44.4 |
| Llama 3.2 90B | 73.0 | 40.6 | 73.3 | 62.9 | 92.7 | 90.9 | 61.6 | 89.8 | 28.9 | 30.1 |
| InternVL2-26B | 71.5 | 41.5 | 71.5 | 55.8 | 90.3 | 91.2 | 58.2 | 70.2 | 32.3 | 28.6 |
| InternVL-Chat-V1.5 | 70.1 | 37.5 | 70.7 | 56.9 | 83.6 | 83.1 | 55.3 | 61.2 | 22.0 | 18.8 |
| Claude 3 Opus | 69.5 | 41.1 | 72.0 | 66.5 | 84.2 | 86.7 | 56.9 | 66.9 | 34.9 | 44.4 |
| Qwen-VL-MAX | 69.2 | 37.5 | 70.0 | 68.5 | 83.1 | 87.2 | 53.1 | 66.1 | 27.6 | 37.6 |
| LLaVA-1.6-34B | 68.1 | 37.5 | 70.4 | 60.4 | 71.0 | 81.8 | 48.6 | 58.8 | 31.9 | 36.8 |
| Claude 3 Sonnet | 67.8 | 38.3 | 71.1 | 50.8 | 86.7 | 80.3 | 58.2 | 78.6 | 32.3 | 30.8 |
| Reka Core | 67.4 | 37.3 | 67.5 | 71.1 | 76.5 | 79.9 | 56.9 | 59.6 | 25.0 | 39.1 |
| Reka Flash | 67.4 | 36.6 | 73.6 | 53.8 | 71.3 | 76.8 | 59.6 | 62.5 | 32.8 | 23.3 |
| InternVL-Chat-V1.2 | 67.2 | 36.0 | 70.7 | 54.8 | 51.8 | 76.3 | 60.0 | 59.2 | 25.4 | 33.8 |
| Qwen-VL-PLUS | 67.0 | 35.9 | 66.2 | 56.9 | 84.1 | 83.1 | 57.5 | 52.7 | 19.8 | 27.1 |
| Claude 3 Haiku | 66.1 | 37.5 | 67.8 | 58.4 | 88.3 | 83.0 | 59.8 | 59.4 | 32.8 | 45.9 |
| Gemini 1.0 Pro | 66.1 | 35.0 | 67.6 | 60.9 | 70.3 | 81.3 | 55.7 | 51.8 | 29.3 | 39.8 |
| InternLM-XComposer2-VL | 62.1 | 33.6 | 66.9 | 40.6 | 54.7 | 74.9 | 56.3 | 46.5 | 28.9 | 24.8 |
| InternVL-Chat-V1.1 | 58.5 | 30.9 | 68.0 | 46.7 | 38.3 | 64.6 | 52.5 | 37.5 | 28.4 | 30.8 |
| Yi-VL-34B | 58.5 | 30.6 | 68.0 | 53.8 | 21.5 | 59.7 | 53.3 | 41.4 | 27.6 | 29.3 |
| OmniLMM-12B | 58.2 | 29.2 | 67.3 | 54.8 | 42.3 | 70.2 | 48.6 | 26.9 | 31.9 | 32.3 |
| DeepSeek-VL-7B-Chat | 56.7 | 26.5 | 61.3 | 41.1 | 39.4 | 69.9 | 50.8 | 32.0 | 21.1 | 14.3 |
| Yi-VL-6B | 55.4 | 30.1 | 65.6 | 45.7 | 23.6 | 62.3 | 52.2 | 28.0 | 27.6 | 19.5 |
| InfiMM-Zephyr-7B | 53.7 | 29.4 | 62.5 | 44.2 | 21.9 | 46.1 | 46.1 | 27.6 | 26.7 | 25.6 |
| CogVLM | 51.5 | 23.7 | 54.4 | 25.4 | 46.4 | 70.5 | 46.5 | 56.1 | 21.6 | 11.3 |
| MiniCPM-V | 51.5 | 25.9 | 59.1 | 32.0 | 53.2 | 76.6 | 40.8 | 32.2 | 23.7 | 18.0 |
| Marco-VL | 50.5 | 24.3 | 56.0 | 37.1 | 48.2 | 58.1 | 37.3 | 40.6 | 19.0 | 27.8 |
| LLaVA-1.5-13B | 50.2 | 26.0 | 56.9 | 32.5 | 22.4 | 53.7 | 42.9 | 24.3 | 19.0 | 24.8 |
| SVIT | 49.9 | 25.4 | 59.1 | 35.5 | 19.9 | 51.2 | 42.9 | 27.8 | 27.6 | 15.8 |
| mPLUG-OWL2 | 48.9 | 22.5 | 57.5 | 28.9 | 26.9 | 59.7 | 39.8 | 29.4 | 28.0 | 10.5 |
| SPHINX | 47.5 | 23.8 | 54.5 | 39.1 | 16.4 | 51.0 | 41.4 | 24.5 | 19.8 | 18.0 |
| InstructBLIP-T5-XXL | 46.2 | 21.5 | 58.0 | 31.0 | 11.2 | 41.7 | 44.3 | 24.5 | 19.4 | 28.6 |
| InstructBLIP-T5-XL | 45.5 | 22.9 | 53.1 | 32.0 | 14.5 | 44.5 | 44.5 | 12.9 | 21.1 | 18.8 |
| BLIP-2 FLAN-T5-XXL | 45.2 | 21.6 | 55.1 | 33.0 | 13.5 | 46.3 | 42.2 | 29.6 | 22.8 | 17.3 |
| BLIP-2 FLAN-T5-XL | 43.0 | 20.0 | 52.5 | 33.5 | 16.3 | 40.9 | 39.2 | 9.4 | 23.3 | 11.3 |
| Adept Fuyu-Heavy | 37.4 | 19.4 | 43.5 | 26.4 | 6.9 | 41.1 | 35.5 | 8.2 | 21.6 | 11.3 |
| LLaMA-Adapter2-7B | 36.6 | 20.4 | 42.5 | 32.5 | 15.6 | 23.7 | 44.5 | 25.1 | 18.1 | 14.3 |
| Otter | 34.1 | 18.5 | 42.5 | 31.5 | 5.3 | 17.9 | 21.2 | 21.4 | 23.3 | 9.8 |
| MiniGPT4-Vicuna-13B | 32.1 | 15.8 | 38.2 | 25.4 | 15.4 | 23.4 | 33.7 | 18.4 | 15.5 | 13.5 |
*MixEval-X Video2Text Leaderboard
| Video2Text 🥇 |
Video2Text-Hard 🥇 |
ActivityNet-QA (Mixed) |
HowToQA (Mixed) |
TVQA (Mixed) |
MSVD-QA (Mixed) |
NextQA-freetext (Mixed) |
TGIF-QA (Mixed) |
ActivityNet-QA-Hard (Mixed) |
TVQA-Hard (Mixed) |
|
| Claude 3.5 Sonnet | 74.2 | 45.5 | 73.3 | 76.6 | 64.8 | 79.4 | 76.4 | 78.9 | 60.4 | 39.4 |
| GPT-4o | 72.7 | 38.9 | 64.6 | 78.2 | 74.6 | 80.9 | 70.1 | 78.2 | 32.4 | 48.0 |
| Gemini 1.5 Pro | 71.8 | 38.1 | 65.2 | 64.8 | 82.6 | 82.9 | 74.4 | 75.7 | 43.2 | 68.5 |
| GPT-4V | 71.0 | 40.0 | 63.4 | 78.2 | 69.5 | 77.9 | 69.5 | 78.5 | 37.2 | 37.8 |
| Qwen2-VL-72B | 66.5 | 32.0 | 55.1 | 76.6 | 58.1 | 74.2 | 65.0 | 78.5 | 27.3 | 17.3 |
| Gemini 1.5 Flash | 66.3 | 33.9 | 59.0 | 67.4 | 70.3 | 73.8 | 61.4 | 72.3 | 26.7 | 51.2 |
| LLaVA-OneVision-72B-OV | 64.7 | 32.0 | 56.0 | 77.0 | 64.4 | 71.2 | 64.9 | 70.6 | 35.6 | 28.3 |
| Qwen2-VL-7B | 64.2 | 31.9 | 54.3 | 74.7 | 52.1 | 74.9 | 62.6 | 68.9 | 27.2 | 26.0 |
| LLaVA-Next-Video-34B | 63.1 | 28.4 | 56.1 | 68.6 | 62.7 | 74.0 | 62.8 | 68.0 | 26.7 | 38.6 |
| Claude 3 Haiku | 58.7 | 29.4 | 52.3 | 63.6 | 48.7 | 70.8 | 62.7 | 70.2 | 23.6 | 29.1 |
| LLaVA-Next-Video-7B | 58.7 | 27.2 | 53.2 | 62.1 | 44.5 | 72.5 | 61.0 | 74.4 | 25.9 | 33.1 |
| Reka-edge | 58.7 | 27.3 | 51.7 | 72.4 | 46.6 | 69.1 | 59.3 | 65.2 | 29.0 | 22.8 |
| LLaMA-VID | 55.6 | 23.8 | 52.9 | 60.9 | 36.0 | 72.8 | 61.3 | 67.1 | 19.1 | 17.3 |
| VideoLLaVA | 55.3 | 22.6 | 51.7 | 64.0 | 39.4 | 66.7 | 61.9 | 64.7 | 18.2 | 26.0 |
| Video-ChatGPT | 46.4 | 20.7 | 45.7 | 46.7 | 25.4 | 72.2 | 56.3 | 64.8 | 24.7 | 14.2 |
| mPLUG-video | 39.1 | 17.8 | 41.5 | 36.4 | 23.3 | 71.9 | 56.7 | 61.8 | 22.7 | 7.9 |
MixEval-X Audio2Text Leaderboard
| Audio2Text 🥇 |
Audio2Text-Hard 🥇 |
Clotho-AQA (Mixed) |
DAQA (Mixed) |
Clotho-AQA-Hard (Mixed) |
DAQA-Hard (Mixed) |
|
| Gemini 1.5 Pro | 62.7 | 24.0 | 67.4 | 53.4 | 26.8 | 21.7 |
| Gemini 1.5 Flash | 60.1 | 23.0 | 67.1 | 46.9 | 27.4 | 19.7 |
| Qwen2-Audio-7B-Instruct | 58.8 | 23.5 | 64.7 | 46.0 | 22.5 | 23.5 |
| Qwen2-Audio-7B | 56.6 | 24.6 | 63.1 | 44.0 | 29.9 | 20.0 |
| SALMONN-13B | 52.5 | 20.9 | 57.6 | 41.4 | 14.9 | 25.4 |
| Qwen-Audio | 52.4 | 16.0 | 61.5 | 33.8 | 19.0 | 12.8 |
| Qwen-Audio-Chat | 50.2 | 20.0 | 55.7 | 39.4 | 19.8 | 19.7 |
| SALMONN-7B | 38.9 | 17.1 | 46.6 | 22.2 | 20.6 | 11.6 |
| Pengi | 22.6 | 8.2 | 26.9 | 14.4 | 12.5 | 3.8 |
MixEval-X Text2Image Leaderboard
| Text2Image Elo 🥇 |
95% CI | Text2Image Elo (1st Turn) 🥇 |
95% CI (1st Turn) | Text2Image Elo (2nd Turn) 🥇 |
95% CI (2nd Turn) | |
| Flux | 1054 | -11/15 | 1054 | -20/20 | 1058 | -15/21 |
| DALL·E 3 HD | 1047 | -11/12 | 1062 | -19/19 | 1031 | -17/24 |
| PixArtAlpha | 1037 | -15/14 | 1031 | -18/21 | 1041 | -17/16 |
| PlayGround V2.5 | 1027 | -12/14 | 1027 | -20/26 | 1030 | -24/16 |
| PlayGround V2 | 1023 | -13/12 | 1021 | -22/17 | 1022 | -16/19 |
| SD3 | 993 | -18/12 | 986 | -18/18 | 998 | -18/17 |
| Stable Cascade | 961 | -13/15 | 968 | -24/18 | 956 | -19/25 |
| SD1.5 | 936 | -14/14 | 931 | -16/21 | 940 | -22/22 |
| SDXL | 916 | -13/14 | 918 | -18/18 | 918 | -21/20 |
MixEval-X Text2Video Leaderboard
| Text2Video Elo 🥇 |
95% CI | Text2Video Elo (1st Turn) 🥇 |
95% CI (1st Turn) | Text2Video Elo (2nd Turn) 🥇 |
95% CI (2nd Turn) | |
| HotShot-XL | 1024 | -8/10 | 1024 | -12/14 | 1025 | -12/11 |
| CogVideoX-5B | 1014 | -10/8 | 1020 | -14/12 | 1008 | -11/14 |
| LaVie | 1013 | -9/10 | 1009 | -14/12 | 1017 | -11/14 |
| VideoCrafter2 | 996 | -9/8 | 1002 | -14/12 | 990 | -13/10 |
| ModelScope | 995 | -9/9 | 987 | -13/13 | 1004 | -16/11 |
| ZeroScope V2 | 984 | -10/11 | 972 | -11/10 | 998 | -14/14 |
| Show-1 | 970 | -7/8 | 983 | -12/12 | 955 | -13/12 |
MixEval-X Text2Audio Leaderboard
| Text2Audio Elo 🥇 |
95% CI | Text2Audio Elo (1st Turn) 🥇 |
95% CI (1st Turn) | Text2Audio Elo (2nd Turn) 🥇 |
95% CI (2nd Turn) | |
| AudioLDM 2 | 1034 | -14/18 | 1036 | -19/19 | 1036 | -19/19 |
| Make-An-Audio 2 | 1019 | -14/16 | 1023 | -19/23 | 1012 | -20/32 |
| Stable Audio | 1019 | -14/14 | 1023 | -17/22 | 1018 | -23/19 |
| Tango 2 | 1010 | -16/16 | 995 | -27/17 | 1025 | -27/18 |
| ConsistencyTTA | 1005 | -17/15 | 1005 | -24/24 | 1006 | -22/26 |
| AudioGen | 982 | -13/14 | 978 | -16/23 | 985 | -22/22 |
| Magnet | 926 | -14/16 | 939 | -20/28 | 912 | -16/23 |
MixEval-X Text2Action Leaderboard
MixEval-X Image2Action Leaderboard
Meta-Evaluation
Benchmark Query Distribution
Citation
@article{ni2024mixevalx,
title={MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures},
author={Ni, Jinjie and Song, Yifan and Ghosal, Deepanway and Li, Bo and Zhang, David Junhao and Yue, Xiang and Xue, Fuzhao and Zheng, Zian and Zhang, Kaichen and Shah, Mahir and Jain, Kabir and You, Yang and Shieh, Michael},
journal={arXiv preprint arXiv:2410.13754},
year={2024}
}
@article{ni2024mixeval,
title={MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures},
author={Ni, Jinjie and Xue, Fuzhao and Yue, Xiang and Deng, Yuntian and Shah, Mahir and Jain, Kabir and Neubig, Graham and You, Yang},
journal={arXiv preprint arXiv:2406.06565},
year={2024}
}