You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
βοΈ MCU-turbo: A Standard Benchmark for Evaluating Minecraft Agents
MCU-turbo is a standard benchmark based on the MCU framework, which originally features over 3000+ atomic tasks. This benchmark is designed to be a standard test, selecting 80 atomic tasks across 10 categories and 20 compositional tasks. Each task is evaluated under two difficulty levelsβSimple and Hardβto rigorously test agent generalization, tool use, planning, and robustness under environmental variations.
.
π Simple mode: Tasks begin with sufficient necessary resources pre-supplied and a clear environment.
πͺοΈ Hard mode: Agents face limited resources and disruptive factors such as poor visibility (e.g. bad weather, night-time), extra distractors (e.g., swarms of mobs, scattered items).
π Dual Difficulty: Each task runs in both simple and hard versions to evaluate intra-task generalization.
π¦ Agent-Agnostic: Compatible with MineStudio agents or any API-based Minecraft wrapper.
π― VLM-based Evaluation: A vision-language model analyzes video trajectories using multi-dimensional criteria.
π§ͺ Task Overview
Below is a curated subset of tasks from the full set of 80, organized by category. Tasks marked with π and π indicate presence in both simple and hard modes.
π All tasks include executable task configs in/MCU/MCU_benchmark/task_configs.
π The analysis of our baseline results can be found in /MCU/docs/baseline.md.
cd MCU_benchmark
python run_task.py \
--difficulty simple
Evaluation video are automatically saved in output/.
VLM evaluation:
cd auto_eval
python batch_video_rating.py \
--videos_path='./output/' \
--criteria_files_path='./auto_eval/criteria_files/'
π Reference
Please consider citing the following paper:
@inproceedings{zheng2025mcu,
title = {MCU: An Evaluation Framework for Open-Ended Game Agents},
author = {Zheng, Xinyue and Lin, Haowei and He, Kaichen and Wang, Zihao and Zheng, Zilong and Liang, Yitao},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning (ICML)},
year = {2025},
url = {https://arxiv.org/abs/2310.08367}
}
π€ Contribute
You can contribute new tasks or difficulty configurations. Submit PRs or open issues to discuss!