XModBench is a comprehensive benchmark designed to evaluate the cross-modal capabilities and consistency of omni-language models. It systematically assesses model performance across multiple modalities (text, vision, audio) and various cognitive tasks, revealing critical gaps in current state-of-the-art models.
- π― Multi-Modal Evaluation: Comprehensive testing across text, vision, and audio modalities
- π§© 5 Task Dimensions: Perception, Spatial, Temporal, Linguistic, and Knowledge tasks
- π 13 SOTA Models Evaluated: Including Gemini 2.5 Pro, Qwen2.5-Omni, EchoInk-R1, and more
- π Consistency Analysis: Measures performance stability across different modal configurations
- π₯ Human Performance Baseline: Establishes human-level benchmarks for comparison
# Clone the repository
git clone https://github.com/XingruiWang/XModBench.git
cd XModBench
# Install dependencies
pip install -r requirements.txtXModBench/
βββ data/
β βββ text/
β β βββ perception/
β β βββ spatial/
β β βββ temporal/
β β βββ linguistic/
β β βββ knowledge/
β βββ vision/
β β βββ [same task categories]
β βββ audio/
β βββ [same task categories]
βββ models/
β βββ evaluation_scripts/
βββ results/
β βββ model_performances/
βββ analysis/
βββ visualization/
#!/bin/bash
#SBATCH --job-name=VLM_eval
#SBATCH --output=log/job_%j.out
#SBATCH --error=log/job_%j.log
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=4
echo "Running on host: $(hostname)"
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
module load conda
# conda activate vlm
conda activate omni
export audioBench='/home/xwang378/scratch/2025/AudioBench'
# python $audioBench/scripts/run.py \
# --model gemini \
# --task_name perception/vggss_audio_vision \
# --sample 1000
# python $audioBench/scripts/run.py \
# --model gemini \
# --task_name perception/vggss_vision_audio \
# --sample 1000
# python $audioBench/scripts/run.py \
# --model gemini \
# --task_name perception/vggss_vision_text \
# --sample 1000
# python $audioBench/scripts/run.py \
# --model gemini \
# --task_name perception/vggss_audio_text \
# --sample 1000
# Qwen2.5-Omni
# python $audioBench/scripts/run.py \
# --model qwen2.5_omni \
# --task_name perception/vggss_audio_text \
# --sample 1000
python $audioBench/scripts/run.py \
--model qwen2.5_omni \
--task_name perception/vggss_vision_text \
--sample 1000
| Model | Perception | Spatial | Temporal | Linguistic | Knowledge | Average |
|---|---|---|---|---|---|---|
| Gemini 2.5 Pro | 75.9% | 50.1% | 60.8% | 76.8% | 89.3% | 70.6% |
| Human Performance | 91.0% | 89.7% | 88.9% | 93.9% | 93.9% | 91.5% |
- Strong Performance: Perception and linguistic tasks (~75% for best models)
- Weak Performance: Spatial (50.1%) and temporal reasoning (60.8%)
- Performance Drop: 15-25 points decrease in spatial/temporal vs. perception tasks
- Audio vs. Text: 20-49 point performance drop
- Audio vs. Vision: 33-point average gap
- Vision vs. Text: ~15-point disparity
- Consistency: Best models show 10-12 point standard deviation
- VisionβText: 9-17 point gaps between directions
- AudioβText: 6-8 point asymmetries
- Root Cause: Training data imbalance favoring image-to-text over inverse directions
If you use XModBench in your research, please cite our paper:
@article{wang2024xmodbench,
title={XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models},
author={Wang, Xingrui, etc.},
journal={arXiv preprint arXiv:2510.15148},
year={2024}
}This project is licensed under the MIT License - see the LICENSE file for details.
We thank all contributors and the research community for their valuable feedback and suggestions.
- Project Lead: Xingrui Wang
- Email: [xwang378@jhu.edu]
- Website: https://xingruiwang.github.io/projects/XModBench/
- Release Huggingface data
- Release data processing code
- Release data evaluation code
Note: XModBench is actively maintained and regularly updated with new models and evaluation metrics. For the latest updates, please check our releases page.
