We propose a novel post-training paradigm, Visual Game Learning (ViGaL), where MLLMs develop out-of-domain generalization of multimodal reasoning through playing arcade-like games. Specifically, we show that post-training a 7B-parameter MLLM via reinforcement learning (RL) on simple arcade-like games like Snake and Rotation puzzle significantly enhances its downstream performance on multimodal reasoning benchmarks such as MathVista, MathVerse, and MathVision, without seeing any worked solutions, equations, or diagrams during RL. Remarkably, the resulting model surpasses large-scale proprietary models and models tuned directly on visual math datasets. Ablation studies indicate that distinct games unlock complementary reasoning skills, leading to improved generalization when combined. Our findings suggest a new post-training paradigm: synthetic, rule-based games can serve as controllable and scalable pre-text tasks that effectively unlock generalizable multimodal reasoning abilities in MLLMs.
git clone https://github.com/yunfeixie233/ViGaL.git
cd ViGaL
pip install -e .[vllm]
pip install flash_attn --no-build-isolationPlease see ViGaL Weights.
You can download our training data from ViGaL training data (Coming Soon) (link will be available soon)
-
For Snake game:
sh examples/scripts/train_snake.sh
-
For Rotation game:
sh examples/scripts/train_rotation.sh
-
For Snake and Rotation games:
sh examples/scripts/train_snake_rotation.sh
-
For MathVista, MathVision, and MathVerse: We use the evaluation code in the
eval/directory. -
For CLEVR+ and Geometry: Please implement the evaluation following Reason-RFT.
-
For MMMU validation set evaluation: Please implement the evaluation following Qwen2.5-VL.
-
For other general visual evaluation: Please implement the evaluation following VLMEvalKit.
We evaluate ViGaL trained on games on outâofâdomain tasks that demand reasoning spanning mathematics, 3D understanding in CLEVR+, geometric problem solving, and multiâdiscipline on MMMU series. Here are our findings:
- Zeroâshot generalization from gameplay to math reasoning and beyond. ViGaL outperforms models specifically fineâtuned with RL on mathematical, spatial, and multiâdiscipline reasoning tasks, showing remarkable generalization capabilities despite having no exposure to inâdomain training data during RL postâtraining.
- Blending both games leads to better generalization. Visual Game Learning shows promise as a new training paradigm that can enhance generalizable reasoning performance without requiring extensive collections of domainâspecific training data. Simply expanding the diversity of games during training leads to consistent performance scaling across various visualâreasoning problems.
- Preserving general visual capabilities while reasoning enhancement. Experiments on more general and comprehensive multimodal benchmarks show that our gameplayâbased approach enables math generalization without compromising other visual abilities.
| Model | Avg. | Math | Geometry | CLEVR+ | MultiâDiscipline | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Avg. | MathVista | MathVerse | MathVision | Avg. | GeoMath | Geo3K | Avg. | CLEVRâM | SâCLEVR | Avg. | MMMUval | MMMUâProoverall | ||
| Proprietary Model | ||||||||||||||
| GPTâ4o | 51.7 | 48.1 | 61.4 | 50.2 | 30.4 | 46.8 | 50.2 | 43.5 | 51.2 | 68.1 | 34.3 | 60.5 | 69.1 | 51.9 |
| Geminiâ2.0âFlash | - | 56.4 | 73.4 | 54.6 | 41.3 | 54.4 | 55.3 | 53.5 | 46.3 | 64.9 | 27.6 | - | 71.9 | - |
| General Multimodal Language Model | ||||||||||||||
| InternVL2.5â8B | 51.5 | 41.2 | 64.4 | 39.5 | 19.7 | 55.2 | 63.0 | 47.3 | 64.4 | 93.5 | 35.3 | 45.2 | 56.0 | 34.3 |
| LlavaâOVâ7B | - | - | 63.2 | 26.2 | - | 60.7 | 77.6 | 43.7 | 49.4 | 69.7 | 29.1 | 36.5 | 48.8 | 24.1 |
| Qwen2.5âVLâ7B | 48.3 | 47.7 | 68.0 | 49.0 | 26.0 | 44.8 | 44.0 | 45.6 | 54.9 | 74.6 | 35.2 | 45.7 | 54.3 | 37.0 |
| Multimodal Reasoning Model PostâTrained on Qwen2.5âVLâ7B | ||||||||||||||
| R1âOnevisionâ7B | 47.3 | 46.8 | 64.1 | 46.4 | 29.9 | 35.0 | 45.4 | 24.5 | 65.1 | 75.5 | 54.7 | 42.3 | 51.9 | 32.6 |
| R1âVLâ7B | 47.3 | 42.7 | 63.5 | 40.0 | 24.7 | 39.0 | 42.0 | 36.1 | 68.0 | 87.4 | 48.6 | 39.7 | 50.0 | 29.4 |
| MMâEurekaâQwenâ7B | 51.1 | 50.1 | 73.0 | 50.3 | 26.9 | 28.4 | 53.1 | 3.8 | 79.3 | 98.4 | 60.1 | 46.4 | 55.8 | 36.9 |
| ReasonâRFTâZeroâ7B | 52.5 | 38.1 | 60.7 | 35.3 | 18.3 | 54.9 | 55.0 | 54.8 | 76.2 | 99.4 | 53.0 | 40.9 | 51.2 | 30.6 |
| VLAAâThinkerâ7B | 56.5 | 48.7 | 68.0 | 51.7 | 26.4 | 53.9 | 51.1 | 56.6 | 83.4 | 94.7 | 72.1 | 40.1 | 48.2 | 31.9 |
| OpenVLThinkerâ7B | 56.3 | 47.8 | 70.2 | 47.9 | 25.3 | 56.4 | 49.2 | 63.5 | 82.4 | 93.8 | 71.0 | 38.5 | 54.8 | 22.1 |
| ViGaL Snake | 58.3 | 49.4 | 70.7 | 51.1 | 26.5 | 55.0 | 49.9 | 60.0 | 82.6 | 92.6 | 72.6 | 46.2 | 55.8 | 36.6 |
| ViGaL Rotation | 58.4 | 49.3 | 71.2 | 50.4 | 26.3 | 57.9 | 51.7 | 64.1 | 80.7 | 93.0 | 68.3 | 45.9 | 54.1 | 37.7 |
| ViGaL Snake + Rotation | 59.3 | 50.6 | 71.9 | 52.4 | 27.5 | 57.1 | 51.0 | 63.3 | 81.7 | 91.9 | 71.4 | 47.7 | 58.0 | 37.4 |
Main results on multimodal reasoning benchmarks. We primarily compare with multimodal reasoning models postâtrained on math data based on Qwen2.5âVLâ7B. CLEVRâM denotes CLEVRâMath, and SâCLEVR stands for SuperâCLEVR. Results from reasoning models postâtrained with corresponding inâdomain data are deâemphasized, while our ViGaL models remain exclusively postâtrained using visual games. Best scores of postâtrained models in each "Avg." column are highlighted in bold.
| Model | Avg. | General | VisionâCentric | OCR & Chart | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Avg. | MuirâBench | CRPErel. | Avg. | MMVP | RealâWorldQA | MMStar | BLINKval | MMEp | Avg. | AI2Dw. M. | SEEDâBenchâ2+ | DocVQAval | OCRâBench | ||
| Proprietary Model | |||||||||||||||
| GPTâ4o | 74.8 | 72.3 | 68.0 | 76.6 | 69.4 | - | 75.4 | 64.7 | 68.0 | 1614 | 82.6 | 84.6 | 72.0 | 91.1 | 736 |
| General Multimodal Language Model | |||||||||||||||
| Qwen2.5âVLâ7B | 72.4 | 68.0 | 59.6 | 76.4 | 65.8 | 74.3 | 68.5 | 63.9 | 56.4 | 1698 | 83.3 | 83.9 | 70.4 | 95.7 | 864 |
| Multimodal Reasoning Model PostâTrained on Qwen2.5âVLâ7B | |||||||||||||||
| R1âOnevisionâ7B | - | 66.8 | 46.3 | 87.3 | 56.5 | 61.3 | 58.0 | 57.8 | 48.7 | 1504 | - | - | - | - | - |
| R1âVLâ7B | 67.4 | 63.3 | 54.1 | 72.4 | 59.6 | 70.3 | 61.4 | 55.6 | 51.0 | 1657 | 79.2 | 81.7 | 66.4 | 89.4 | 81.0 |
| MMâEurekaâQwenâ7B | 71.8 | 68.9 | 61.1 | 76.7 | 65.1 | 74.3 | 66.1 | 65.9 | 54.0 | 1626 | 81.5 | 84.3 | 68.2 | 92.0 | 87.0 |
| ReasonâRFTâZeroâ7B | 68.4 | 66.9 | 58.5 | 75.2 | 58.5 | 58.0 | 65.3 | 59.1 | 51.6 | 1653 | 79.8 | 83.3 | 68.0 | 88.1 | 82.0 |
| VLAAâThinkerâ7B | 69.7 | 65.9 | 57.1 | 74.6 | 62.6 | 71.6 | 65.4 | 60.4 | 53.0 | 1593 | 80.6 | 83.4 | 67.4 | 90.9 | 84.5 |
| OpenVLThinkerâ7B | - | 64.3 | 52.8 | 75.8 | 50.4 | 32.3 | 60.2 | 59.1 | 49.9 | 1513 | - | - | - | - | - |
| ViGaL Snake + Rotation | 72.2 | 68.6 | 60.5 | 76.7 | 65.7 | 74.6 | 67.3 | 65.4 | 55.6 | 1685 | 82.2 | 84.8 | 69.1 | 92.7 | 86.6 |
Main results on multimodal language benchmarks targeting more general and comprehensive visual ability. We compare with models postâtrained on Qwen2.5âVLâ7B. Best category averages are highlighted in bold. Note that MMEp is excluded from visionâcentric category average accuracy due to scale differences.
If you find ViGaL useful for your research and applications, please cite using this BibTeX:
@article{xie2025play,
title = {Play to Generalize: Learning to Reason Through Game Play},
author = {Xie, Yunfei and Ma, Yinsong and Lan, Shiyi and Yuille, Alan and Xiao, Junfei and Wei, Chen},
journal = {arXiv preprint arXiv:2506.08011},
year = {2025},
}- MM-EUREKA: We start from the codebase from MM-EUREKA
