| CARVIEW |
MORSE-500
A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning
|
|
|
|
|
|
|
|
Example Videos
Key Features
MORSE-500 addresses critical limitations in existing multimodal reasoning benchmarks through several key innovations that push beyond static image analysis into dynamic video understanding:
500 newly cooked video clips with CSV metadata that runs fast and streams efficiently
Videos are generated programmatically so we can dial up complexity and release harder versions as models improve
Spanning Abstract, Mathematical, Physical, Planning, Spatial, Temporal (+ Causal) – a vibrant mix of the reasoning types that matter
Video-based tasks requiring understanding of dynamic sequences, causal chains, and temporal relationships that unfold over time – something static images simply cannot capture
Questions are baked right into the videos. No text crutches, no shortcuts – if you can't see it, you can't solve it
A "-view" subset streams directly on Hugging Face, making browsing and debugging smoother than a sunny afternoon
Reasoning Categories
(porportion of data %)
Pattern recognition, logical inference, symbolic reasoning
Arithmetic operations, algebraic relations, quantitative comparisons
Object dynamics, causal interactions, physics laws
Multi-step reasoning, goal-directed problem solving
Object relationships, spatial transformations, 3D reasoning
Sequence understanding, causal inference over time
Leaderboard
| Rank | Model | Model Type | Date | ALL | Abstract | Math | Physical | Planning | Spatial | Temporal |
| - | Human | Human 👤 | 2025-05-28 | 55.4 | 37.5 | 45.5 | 56.3 | 56.0 | 73.1 | 55.2 |
| 1 | o3 🥇 | Reasoning VLM 🖼️ 💭 | 2025-05-28 | 23.6 | 23.4 | 27.4 | 28.1 | 5.0 | 29.6 | 31.2 |
| 2 | o4-mini 🥈 | Reasoning VLM 🖼️ 💭 | 2025-05-28 | 22.2 | 21.9 | 23.8 | 29.7 | 5.0 | 27.8 | 28.7 |
| 3 | Gemini 2.5 Pro 🥉 | Reasoning VLM 🖼️ 💭 | 2025-05-28 | 21.8 | 18.8 | 36.9 | 29.7 | 3.0 | 16.7 | 32.5 |
| 4 | o1 | Reasoning VLM 🖼️ 💭 | 2025-05-28 | 19.8 | 17.2 | 22.6 | 28.1 | 5.0 | 23.1 | 26.2 |
| 5 | Gemini 2.5 Flash | Reasoning VLM 🖼️ 💭 | 2025-05-28 | 19.2 | 9.4 | 35.7 | 28.1 | 1.0 | 24.1 | 18.8 |
| 6 | Gemini 1.5 Pro | VLM 🎬 | 2025-05-28 | 18.8 | 12.5 | 21.4 | 26.6 | 1.0 | 26.9 | 26.2 |
| 7 | Qwen2.5 VL 72B | VLM 🎬 | 2025-05-28 | 17.8 | 6.2 | 21.4 | 34.4 | 1.0 | 22.2 | 25.0 |
| 8 | GPT 4o | Unified Model 🎭 | 2025-05-28 | 17.4 | 17.2 | 20.2 | 34.4 | 4.0 | 12.0 | 25.0 |
| 9 | Qwen2.5 VL 32B AWQ | VLM 🎬 | 2025-05-28 | 16.8 | 14.1 | 23.8 | 34.4 | 1.0 | 15.7 | 18.8 |
| 10 | Qwen2.5 VL 72B AWQ | VLM 🎬 | 2025-05-28 | 16.4 | 12.5 | 11.9 | 29.7 | 2.0 | 27.8 | 16.2 |
| 11 | Gemini 2.0 Flash | VLM 🎬 | 2025-05-28 | 16.0 | 12.5 | 29.8 | 28.1 | 0.0 | 13.0 | 18.8 |
| 12 | Qwen2.5 VL 32B | VLM 🎬 | 2025-05-28 | 15.6 | 9.4 | 19.0 | 29.7 | 2.0 | 16.7 | 21.2 |
| 13 | Gemma 3 27b | VLM 🖼️ | 2025-05-28 | 14.6 | 20.3 | 20.2 | 25.0 | 1.0 | 13.0 | 15.0 |
| 14 | Gemini 2.0 Flash-Lite | VLM 🎬 | 2025-05-28 | 14.2 | 17.2 | 21.4 | 21.9 | 2.0 | 14.8 | 12.5 |
| 15 | MiniCPM-o 2.6 | VLM 🎬 | 2025-05-28 | 11.6 | 4.7 | 10.7 | 23.4 | 1.0 | 16.7 | 15.0 |
| 16 | Qwen2.5 Omni 7B | LMM 🎬🎵 | 2025-05-28 | 11.4 | 6.2 | 9.5 | 21.9 | 2.0 | 15.7 | 15.0 |
| 17 | Qwen2.5 VL 7B | VLM 🎬 | 2025-05-28 | 11.2 | 7.8 | 11.9 | 25.0 | 2.0 | 12.0 | 12.5 |
| 18 | InternVL3 8B | VLM 🖼️ | 2025-05-28 | 7.8 | 6.2 | 6.0 | 14.1 | 1.0 | 11.1 | 10.0 |
| 19 | Qwen2.5 VL 3B | VLM 🎬 | 2025-05-28 | 7.6 | 9.4 | 3.6 | 18.8 | 1.0 | 9.3 | 7.5 |
| 20 | LLaVA-NeXT-Video 7B | VLM 🎬 | 2025-05-28 | 5.0 | 1.6 | 11.9 | 6.2 | 0.0 | 5.6 | 5.0 |
Model Types: 💭 Reasoning • 🖼️ Image • 🎬 Video • 🎵 Audio • 🎭 Unified (Visual Understanding + Generation)
🎯 To submit your results to the leaderboard, please complete this form.
Difficulty Scaling
One of MORSE-500's key innovations is its ability to systematically scale difficulty through programmatic control. The examples below demonstrate how task complexity can be increased while maintaining the core reasoning category.
To illustrate our scaling approach, consider the frozen lake environment: we can incrementally increase difficulty by expanding the maze size, adding more action options, introducing fog effects, or reducing the agent's visible range. Similarly, other tasks in our benchmark can be scaled by manipulating sequence length and other relevant parameters.
BibTeX
@article{cai2025morse500,
title={MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning},
author={Cai, Zikui and Wang, Andrew and Satheesh, Anirudh and Nakhawa, Ankit and Jae, Hyunwoo and Powell, Keenan and Liu, Minghui and Jay, Neel and Oh, Sungbin and Wang, Xiyao and Liang, Yongyuan and Goldstein, Tom and Huang, Furong},
journal={arXiv preprint arXiv:2506.05523},
year={2025}
}