CARVIEW

MOTORHOMES

Select Language

HTTP/2 301 server: GitHub.com content-type: text/html location: https://xingruiwang.github.io/projects/XModBench/ access-control-allow-origin: * strict-transport-security: max-age=31556952 expires: Tue, 30 Dec 2025 11:41:43 GMT cache-control: max-age=600 x-proxy-cache: MISS x-github-request-id: 1B0F:444BC:A10074:B4B07C:6953B81F accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 11:31:44 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210045-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767094304.802416,VS0,VE221 vary: Accept-Encoding x-fastly-request-id: 0e95442dbd8a6fcff1878e30e0fdfed6ba6a14a5 content-length: 162 HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Fri, 12 Dec 2025 18:45:18 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"693c62be-849d" expires: Tue, 30 Dec 2025 11:41:44 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 9D36:2685F2:9EF63A:B2A704:6953B81F accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 11:31:44 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210045-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767094304.037226,VS0,VE215 vary: Accept-Encoding x-fastly-request-id: 58df8161faea0269b3d011fca0be26a9baed79fd content-length: 8422 XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

A comprehensive tri-modal benchmark for evaluating cross-modal consistency across audio, vision, and text in omni-modal large language models

Xingrui Wang^1,2, Jiang Liu¹, Chao Huang^1,3, Xiaodong Yu¹, Ze Wang¹, Ximeng Sun¹, Jialian Wu¹

Alan Yuille², Emad Barsoum¹, Zicheng Liu¹

¹Advanced Micro Devices, ²Johns Hopkins University, ³University of Rochester

Paper Code Dataset Card

XModBench Teaser - Cross-modal benchmark overview

XModBench contains 60K multiple-choice questions across five task families and systematically covers all six cross-modality directions, enabling diagnosis of task competence, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) suffers from modality disparities, with performance dropping by over 20 points on average when audio inputs replace text, and (iii) exhibits directional imbalance, with a 9-point gap when using vision as context versus using text as context.

Abstract

Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks have advanced multimodal evaluation, it remains unclear whether OLLMs achieve modality-invariant reasoning or inherit modality-specific biases. We introduce XModBench, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency.

XModBench contains 60K multiple-choice questions across five task families and systematically covers all six cross-modality directions, enabling diagnosis of task competence, modality disparity, and directional imbalance. The findings suggest that OLLMs fall short of modality-invariant reasoning, and XModBench provides a fundamental diagnostic tool for evaluating and improving their overall cross-modal competence.

60K+

Question-Answer Pairs

Cross-Modal Directions

Task Families

Subtasks

Benchmark Design

Core Innovation: Modality-Balanced Configuration

The central objective of XModBench is to evaluate whether models preserve cross-modal consistency when the same semantic content appears in different modalities. Each item is a four-choice multiple-choice question consisting of a <context> (question stem) and four <candidates> (answer options).

By systematically permuting text (T), vision (V), and audio (A) across the context and candidates, we generate six modality configurations of the same question:

Audio → Text (A→T): Audio context, text candidates
Audio → Vision (A→V): Audio context, visual candidates
Text → Audio (T→A): Text context, audio candidates
Text → Vision (T→V): Text context, visual candidates
Vision → Audio (V→A): Visual context, audio candidates
Vision → Text (V→T): Visual context, text candidates

Three Diagnostic Properties

This balanced design enables unprecedented diagnosis of cross-modal consistency through three complementary evaluation dimensions:

Task Competence
By averaging accuracy across all six modality configurations, we obtain a fair measure of a model's overall capability for each task, independent of modality-specific biases. This reveals which fundamental capabilities models truly possess versus which they fake through modality shortcuts.
→ See task-by-task performance analysis
Modality Disparity
By presenting semantically identical questions under different modality configurations, we isolate modality as the only variable. Accuracy differences reveal which modalities models handle best or worst—for example, comparing T→A vs T→V shows whether models understand audio or vision better when given the same text context.
→ See modality disparity findings (audio drops 49 points vs text)
Directional Imbalance
By examining inverse settings—swapping context and candidate modalities (e.g., V→T vs T→V)—we expose asymmetries in cross-modal grounding. Large gaps indicate that models perform better in certain directions due to training data imbalances, rather than achieving true bidirectional understanding.
→ See directional imbalance analysis (9-17 point gaps discovered)

Task Taxonomy

XModBench covers 5 task families with 17 subtasks, spanning perception, spatial reasoning, temporal reasoning, linguistic understanding, and external knowledge. Each task is formulated in the multiple-choice format and follows the modality-balanced configuration.

👁️

Task 1: Perception

Evaluates recognition of objects, activities, and scenes across modalities

General activities
Fine-grained activities
Natural environments
Musical instruments
Instrument compositions

📐

Task 2: Spatial Reasoning

Tests understanding of object positions and motion in 2D/3D space

2D Arrangement
3D Localization
3D Movement

⏱️

Task 3: Temporal Reasoning

Assesses comprehension of event order and frequency across time

Temporal Order
Temporal Counting
Temporal Calculation

🗣️

Task 4: Linguistic Understanding

Unifies OCR and ASR in cross-modal settings with affective understanding

Recognition (OCR/ASR)
Translation (EN-ZH)
Emotion Classification

🌐

Task 5: External Knowledge

Links multimodal content with world knowledge and cultural understanding

Music Genre
Movie Recognition
Singer Identification

Data Construction Pipeline

The benchmark is built through a rigorous three-stage pipeline:

Cross-Modal Data Collection: Combining re-annotated datasets (VGG-Sound, STARSS23), synthetic generation (TTS, rendered text), and targeted web collection (YouTube trailers, singer portraits)
Question Generation: Task-specific templates refined with GPT, semantically challenging distractors, and diversified prompts
Quality Assurance: LLM filtering, human verification, and iterative testing to ensure accuracy and eliminate ambiguities

Key Findings

We evaluated 13 state-of-the-art omni-modal models including Gemini 2.5 Pro, Qwen2.5-Omni, EchoInk-R1, and others. The results reveal systematic weaknesses across three dimensions:

1. Task Competence Gaps

Models show strong performance on perception and linguistic tasks (best model achieves ~75%), but struggle significantly with spatial and temporal reasoning:

Gemini 2.5 Pro: 75.9% (Perception), 76.8% (Linguistic), but only 50.1% (Spatial) and 60.8% (Temporal)
Spatial & Temporal Reasoning: All models drop 15-25 points compared to perception tasks
Open-source Models: Show even larger gaps, with some scoring below 40% on spatial/temporal tasks

2. Modality Disparity

Performance varies dramatically across modalities, with audio being the most challenging:

Audio vs. Text: Models drop 20-49 points when audio replaces text inputs
Audio vs. Vision: 33-point average gap, showing difficulty in aligning heterogeneous signals
Vision vs. Text: Smaller but still significant ~15-point disparity
Consistency (Std. Dev.): Best models show 10-12 point standard deviation across configurations

3. Directional Imbalance

Models exhibit asymmetric performance when context and candidate modalities are swapped:

Vision↔Text: 9-17 point gaps between V→T and T→V directions
Audio↔Text: 6-8 point asymmetries in bidirectional settings
Audio↔Vision: Nearly symmetric but with much lower overall accuracy
Root Cause: Training data imbalance—models heavily trained on image-to-text QA, less on inverse directions

Human Performance

Human evaluation on sampled questions shows consistently high performance across all modalities:

Overall Average: 91.5% accuracy (vs. 70.6% for best model)
Perception: 91.0% (vs. 75.9%)
Spatial: 89.7% (vs. 50.1%)
Temporal: 88.9% (vs. 60.8%)
Linguistic: 93.9% (vs. 76.8%)
Knowledge: 93.9% (vs. 89.3%)

This demonstrates substantial room for improvement, especially in spatial and temporal reasoning where the human-model gap exceeds 25-30 points.

-49

Audio-Text Disparity

-33

Audio-Vision Gap

V→T vs T→V Imbalance

21%

Gap to Human Performance

Model-Specific Insights

Gemini 2.5 Pro (Best Overall: 70.6% avg, 11.7 std)

Strongest across all task families, but still struggles with audio modality
Relatively balanced performance (lowest std. dev. among top models)
Excels at external knowledge (89.3%) but weak on spatial reasoning (50.1%)

Qwen2.5-Omni (Best Open-Source: 58.6% avg, 10.1 std)

Most consistent open-source model with lowest variance
Strong linguistic understanding (74.1%) comparable to Gemini
Significant drop on spatial (38.4%) and temporal (32.3%) tasks

EchoInk-R1 (Strong Open Alternative: 59.2% avg, 11.3 std)

Competitive with Qwen2.5-Omni, slightly higher variance
Best open-source performance on temporal reasoning (37.1%)
Good linguistic capabilities (73.3%) but weaker audio grounding

Key Takeaway: Even the best models fall far short of modality-invariant reasoning, with systematic biases toward text and vision over audio, and asymmetric performance when modality roles are reversed.

Original Source | Taken Source