| CARVIEW |
Multimodal Inconsistency Reasoning (MMIR)
2eBay
An illustration of multimodal inconsistency reasoning on a webpage. An agent examines a webpage where the brand “IKEA AB” is mentioned, but other elements clearly refer to “Lorell.” Detecting this brand identity misattribution requires the ability to compare text fields across different sections of the page and reconcile them with accompanying images or context—an inherently multimodal reasoning task.
There are five inconsistency categories in the MMIR benchmark, posing diverse challenges.
Introduction
We introduce MMIR, the first benchmark for evaluating Multimodal Large Language Models (MLLMs) on detecting and reasoning about inconsistencies in layout-rich multimodal content. MMIR features 534 challenging samples across five reasoning-heavy inconsistency categories:
- Factual Contradiction: Direct conflict between two elements (text-text, text-image, or image-image) within the modified content.
- Identity Misattribution: Mislabeling of entities (objects, locations, brands etc.) that conflict with other elements.
- Contextual Mismatch: Tonal, thematic, or situational incompatibility between elements.
- Quantitative Discrepancy: Numerical or statistical inconsistencies between elements.
- Temporal/Spatial Incoherence: Implied timelines, dates, or spatial relationships that are impossible or conflicting.
We evaluate six state-of-the-art MLLMs, showing that models with dedicated multimodal reasoning capabilities, such as o1, substantially outperform their counterparts while open-source models remain particularly vulnerable to inconsistency errors.
Detailed error analyses further show that models excel in detecting inconsistencies confined to a single modality, particularly in text, but struggle with cross-modal conflicts and complex layouts. Probing experiments reveal that single-modality prompting, including Chain-of-Thought (CoT) and Set-of-Mark (SoM) methods, yields marginal gains, revealing a key bottleneck in cross-modal reasoning. Our findings highlight the need for advanced multimodal reasoning and point to future research on multimodal inconsistency.
MMIR Benchmark
Overview
MMIR Statistics. Breakdown of the dataset by artifact category and error type.
MMIR Data filtering process.
The MMIR benchmark was meticulously constructed through a four-stage curation pipeline to ensure high-quality, diverse, and challenging test cases. We began by collecting 521 real-world artifacts — including webpages, presentations, and posters — from trusted sources like VisualWebArena and Zenodo. These artifacts were parsed to extract structured metadata, including element types, content, and spatial layouts.
To simulate realistic errors, we used advanced multimodal models to propose 2,534 synthetic inconsistencies across five predefined categories. These proposals underwent automated validation to ensure technical feasibility and alignment with error definitions. Approved edits were then programmatically applied to artifacts using tools like Chrome DevTools (for webpages) and Python libraries (for presentations).
Finally, human experts rigorously reviewed the modified samples, filtering out unrealistic cases and retaining 534 validated entries that balance complexity and real-world relevance. The resulting dataset spans diverse artifact types and error categories, with carefully designed evaluation prompts for both open-ended and multiple-choice settings.
Key Features
- 534 carefully validated samples
- Real-world artifacts: Webpages, Slides, Posters
- Synthetic inconsistency injection
- Multi-stage verification pipeline
Evaluation Settings
- Open-ended: Models receive the artifact with a fixed prompt Qopen_ended and generate a free-form response that identifies the semantic mismatch.
- Multiple-choice: Models receive the artifact with a combined prompt Q_MCQ = (Qopen_ended, Ci). Each candidate in Ci is a textual description of an element. The model must select, from these options, the element(s) corresponding to the introduced inconsistency.
A Qualitative Example showing a test sample in MMIR tested under the two evaluation settings with ground-truth answer and responses of the six tested models.
You can download the dataset on Hugging Face Dataset.
Experiments and Analysis
The accuracy of six MLLMs under the two evaluation settings. Proprietary models demonstrate higher performance as well as larger performance gain in the MCQ setting. While MCQ-style prompts boost GPT-4o's accuracy by ~15%, open-source models gain minimal benefits, highlighting fundamental reasoning gaps.
| | | Open-ended | | | Multiple-choice | |||||||
| # | Model | Source | Web | Office | Poster | Overall | Web | Office | Poster | Overall |
| Proprietary Models | ||||||||||
| 1 | o1 (1217) | Link | 47.91 | 59.19 | 38.73 | 51.40 | 47.91 | 58.52 | 46.47 | 52.15 |
| 2 | GPT-4o (1120) | Link | 25.00 | 42.60 | 30.98 | 33.14 | 37.29 | 58.96 | 47.88 | 47.75 |
| Open-sourced Models | ||||||||||
| 3 | Qwen2.5-VL-7B | Link | 8.54 | 29.14 | 11.97 | 17.60 | 14.37 | 33.18 | 16.90 | 22.56 |
| 4 | LLaVA-NeXT-7B | Link | 10.20 | 21.97 | 7.04 | 14.70 | 11.45 | 25.33 | 5.63 | 16.47 |
| 5 | InternVL2.5-8B | Link | 7.70 | 24.21 | 4.92 | 14.23 | 9.37 | 23.54 | 11.97 | 15.63 |
| 6 | Phi-3.5-Vision-4B | Link | 6.87 | 24.43 | 7.04 | 14.23 | 1.66 | 8.52 | 0.00 | 4.30 |
🚨 To submit your results to the leaderboard, please send to this email with your result JSON files.
🚨 For more submission details, please refer to this link.
Fine-grained error analysis
- Performance Gap: Proprietary models excel at detecting factual contradictions and identity mismatches, but even top models like GPT-4o show limitations in resolving temporal/spatial incoherence.
- Modality Matters: Models handle text-text inconsistencies best but falter with image-image comparisons, exposing weaknesses in visual reasoning.
- Layout Complexity: Performance drops sharply as artifacts become visually dense—models lose up to 40% accuracy on cluttered layouts compared to simple ones.
Fine-grained analysis of model performance across Inconsistency Categories and Modalities.
Model performance on layout complexity.
Prompting Strategies Analysis
To enhance multimodal inconsistency reasoning, we tested three prompting approaches and has the following observations:- Chain-of-Thought (CoT): Explicit textual reasoning steps provided minimal benefits, sometimes reducing accuracy, especially for open-source models.
- Set-of-Mark (SoM): Visual bounding boxes improved GPT-4o’s performance (+5%) but confused other models, often degrading results.
- Multimodal Interleaved CoT (MM-CoT): Our novel two-stage method combined textual reasoning with iterative visual refinement.
Probing results of different prompting methods. Performance of each prompting method is directly compared with the vanilla setting. Gains are in blue and drops are in red.
MM-CoT outperformed all other methods, boosting GPT-4o's accuracy by 4.4% and showing modest gains for open-source models. Proprietary models benefited most from iterative cross-modal integration, while isolated prompts (CoT/SoM) proved ineffective. Visual annotations only helped when guided by initial textual reasoning, highlighting the need for tightly coupled multimodal interaction.
BibTeX
@misc{yan2025multimodalinconsistencyreasoningmmir,
title={Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models},
author={Qianqi Yan and Yue Fan and Hongquan Li and Shan Jiang and Yang Zhao and Xinze Guan and Ching-Chen Kuo and Xin Eric Wang},
year={2025},
eprint={2502.16033},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2502.16033},
}