XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

XModBench is a comprehensive benchmark designed to evaluate the cross-modal capabilities and consistency of omni-language models. It systematically assesses model performance across multiple modalities (text, vision, audio) and various cognitive tasks, revealing critical gaps in current state-of-the-art models.

Key Features

🎯 Multi-Modal Evaluation: Comprehensive testing across text, vision, and audio modalities
🧩 5 Task Dimensions: Perception, Spatial, Temporal, Linguistic, and Knowledge tasks
📊 13 SOTA Models Evaluated: Including Gemini 2.5 Pro, Qwen2.5-Omni, EchoInk-R1, and more
🔄 Consistency Analysis: Measures performance stability across different modal configurations
👥 Human Performance Baseline: Establishes human-level benchmarks for comparison

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/XingruiWang/XModBench.git
cd XModBench
# Install dependencies
pip install -r requirements.txt

📂 Dataset Structure

XModBench/
├── data/
│   ├── text/
│   │   ├── perception/
│   │   ├── spatial/
│   │   ├── temporal/
│   │   ├── linguistic/
│   │   └── knowledge/
│   ├── vision/
│   │   └── [same task categories]
│   └── audio/
│       └── [same task categories]
├── models/
│   └── evaluation_scripts/
├── results/
│   └── model_performances/
└── analysis/
    └── visualization/

Basic Usage

#!/bin/bash
#SBATCH --job-name=VLM_eval        
#SBATCH --output=log/job_%j.out
#SBATCH --error=log/job_%j.log                        
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=4
echo "Running on host: $(hostname)"
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
module load conda
# conda activate vlm
conda activate omni
export audioBench='/home/xwang378/scratch/2025/AudioBench'
# python $audioBench/scripts/run.py \
#     --model gemini \
#     --task_name perception/vggss_audio_vision \
#     --sample 1000
# python $audioBench/scripts/run.py \
#     --model gemini \
#     --task_name perception/vggss_vision_audio \
#     --sample 1000
# python $audioBench/scripts/run.py \
#     --model gemini \
#     --task_name perception/vggss_vision_text \
#     --sample 1000
# python $audioBench/scripts/run.py \
#     --model gemini \
#     --task_name perception/vggss_audio_text \
#     --sample 1000
# Qwen2.5-Omni
# python $audioBench/scripts/run.py \
#         --model qwen2.5_omni \
#         --task_name perception/vggss_audio_text \
#         --sample 1000
python $audioBench/scripts/run.py \
        --model qwen2.5_omni \
        --task_name perception/vggss_vision_text \
        --sample 1000

📈 Benchmark Results

Overall Performance Comparison

Model	Perception	Spatial	Temporal	Linguistic	Knowledge	Average
Gemini 2.5 Pro	75.9%	50.1%	60.8%	76.8%	89.3%	70.6%
Human Performance	91.0%	89.7%	88.9%	93.9%	93.9%	91.5%

Key Findings

1️⃣ Task Competence Gaps

Strong Performance: Perception and linguistic tasks (~75% for best models)
Weak Performance: Spatial (50.1%) and temporal reasoning (60.8%)
Performance Drop: 15-25 points decrease in spatial/temporal vs. perception tasks

2️⃣ Modality Disparity

Audio vs. Text: 20-49 point performance drop
Audio vs. Vision: 33-point average gap
Vision vs. Text: ~15-point disparity
Consistency: Best models show 10-12 point standard deviation

3️⃣ Directional Imbalance

Vision↔Text: 9-17 point gaps between directions
Audio↔Text: 6-8 point asymmetries
Root Cause: Training data imbalance favoring image-to-text over inverse directions

📝 Citation

If you use XModBench in your research, please cite our paper:

@article{wang2024xmodbench,
  title={XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models},
  author={Wang, Xingrui, etc.},
  journal={arXiv preprint arXiv:2510.15148},
  year={2024}
}

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

We thank all contributors and the research community for their valuable feedback and suggestions.

📧 Contact

Project Lead: Xingrui Wang
Email: [xwang378@jhu.edu]
Website: https://xingruiwang.github.io/projects/XModBench/

🔗 Links

Todo

Release Huggingface data
Release data processing code
Release data evaluation code

Note: XModBench is actively maintained and regularly updated with new models and evaluation metrics. For the latest updates, please check our releases page.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.gradio		.gradio
benchmark		benchmark
models		models
scripts		scripts
user_results		user_results
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
app.py		app.py
download.slurm		download.slurm
multiuser_app.py		multiuser_app.py
multiuser_app_vN.py		multiuser_app_vN.py
result.tex		result.tex
run.slurm		run.slurm
run_api.slurm		run_api.slurm
tmp_upload_hf.py		tmp_upload_hf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

Key Features

🚀 Quick Start

Installation

📂 Dataset Structure

Basic Usage

📈 Benchmark Results

Overall Performance Comparison

Key Findings

1️⃣ Task Competence Gaps

2️⃣ Modality Disparity

3️⃣ Directional Imbalance

📝 Citation

📄 License

🙏 Acknowledgments

📧 Contact

🔗 Links

Todo

About

Uh oh!

Languages

XingruiWang/XModBench

Folders and files

Latest commit

History

Repository files navigation

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

Key Features

🚀 Quick Start

Installation

📂 Dataset Structure

Basic Usage

📈 Benchmark Results

Overall Performance Comparison

Key Findings

1️⃣ Task Competence Gaps

2️⃣ Modality Disparity

3️⃣ Directional Imbalance

📝 Citation

📄 License

🙏 Acknowledgments

📧 Contact

🔗 Links

Todo

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages