EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
📄 Paper | 🤗 Dataset | 🏠 Project Website
Rui Yang*, Hanyang Chen*, Junyu Zhang*, Mark Zhao*, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang
University of Illinois Urbana-Champaign, Northwestern University, University of Toronto, Toyota Technological Institute at Chicago
We introduce EmbodiedBench, a comprehensive benchmark designed to evaluate Multi-modal Large Language Models (MLLMs) as embodied agents. While existing benchmarks have primarily focused on Large Language Models (LLMs) and high-level tasks, EmbodiedBench takes a leap forward by offering a comprehensive, fine-grained evaluation of MLLM-based agents across both high-level and low-level tasks, as well as six critical agent capabilities.
EmbodiedBench is more than a benchmark—it’s a multifaceted, standardized evaluation platform that not only uncovers the current challenges in embodied AI but also provides actionable insights to push the boundaries of MLLM-driven embodied agents.
- 2025.10 We released Embodied Reasoning Agent (ERA), a training recipe for VLM-based embodied agents with enhanced reasoning and grounding capability. Explore more on our project page!
- 2025.06.03 We released a large collection of trajectory datasets generated by a diverse set of models, including both closed-source and open-source models. Feel free to use them to train better embodied agents!
- 2025.05.01 EmbodiedBench is accepted to ICML 2025!
- 2025.03.19, we provided support for several recent MLLMs, including "microsoft/Phi-4-multimodal-instruct", 'AIDC-AI/Ovis2-16B', 'AIDC-AI/Ovis2-34B', 'google/gemma-3-12b-it', and fixed some common generated JSON errors.
-
🛠️ Diverse Tasks with Hierarchical Action Levels:
1,128 testing tasks across four environments, spanning from high-level tasks (EB-ALFRED and EB-Habitat) to low-level tasks (EB-Navigation and EB-Manipulation). We created new high-quality datasets and enhanced existing simulators to support comprehensive assessments. -
🎯 Capability-Oriented Evaluation:
Six specialized subsets to evaluate essential agent capabilities, including commonsense reasoning, complex instruction, spatial awareness, visual perception, and long-term planning. -
⚡ Unified APIs for Embodied Environments:
EmbodiedBench provides Gym-style APIs for all environments, ensuring ease of use and seamless agent evaluation. -
🏹 Effortless MLLM/LLM Evaluation (API & Local Support):
- Supports proprietary (e.g., OpenAI API) and open-source models (local execution).
- Enables self-hosted model evaluation using OpenAI API-style calls or offline execution based on LMDeploy.
- While mainly focused on MLLMs, EmbodiedBench also supports LLM evaluation.
-
🔧 Configurable Textual and Visual Designs:
Our flexible configuration options enable in-depth experimentation with visual input, textual and visual in-context prompts, environment feedback, camera resolution, detection boxes, and multi-step/multi-view image inputs and more, empowering researchers to better understand the role of each component in agent performance.
"Fine-grained" indicates a multi-dimensional evaluation approach rather than an overall accuracy.
¹AgentBench and VisualAgentBench include domains such as household, games, and web.
²VLABench is originally used for evaluating VLA models.
| Benchmark | Category | Action Level | #Env. | #Test Tasks | Multimodal | Fine-grained | LLM/VLM Support |
|---|---|---|---|---|---|---|---|
| Alfred | Household | High | 1 | 3062 | ✅ | ❌ | ❌ |
| VLMbench | Manipulation | Low | 1 | 4760 | ✅ | ❌ | ❌ |
| Language Rearrangement | Household | High | 1 | 1000 | ✅ | ✅ | ❌ |
| GOAT-bench | Navigation | Low | 1 | 3919 | ✅ | ❌ | ❌ |
| AgentBench | Multi-domain¹ | High | 8 | 1091 | ❌ | ❌ | ✅ |
| Lota-bench | Household | High | 2 | 308 | ❌ | ❌ | ✅ |
| VisualAgentBench | Multi-domain¹ | High | 5 | 746 | ✅ | ❌ | ✅ |
| Embodied Agent Interface | Household | High | 2 | 438 | ❌ | ✅ | ✅ |
| VLABench | Manipulation | Low² | 1 | 100 | ✅ | ✅ | ✅ |
| EmbodiedBench (ours) | Multi-domain | High & Low | 4 | 1128 | ✅ | ✅ | ✅ |
Note: we need to install three conda environments, one for EB-ALFRED and EB-Habitat, one for EB-Navigation, and one for EB-Manipulation. Please use ssh download instead of HTTP download to avoid error during git lfs pull.
Download repo
git clone git@github.com:EmbodiedBench/EmbodiedBench.git
cd EmbodiedBenchYou have two options for installation: you can either use
bash install.sh or manually run the provided commands. After completing the installation with bash install.sh, you will need to start the headless server and verify that each environment is properly set up.
1️⃣ Environment for Habitat and Alfred
conda env create -f conda_envs/environment.yaml
conda activate embench
pip install -e .2️⃣ Environment for EB-Navigation
conda env create -f conda_envs/environment_eb-nav.yaml
conda activate embench_nav
pip install -e .3️⃣ Environment for EB-Manipulation
conda env create -f conda_envs/environment_eb-man.yaml
conda activate embench_man
pip install -e .Note: EB-Alfred, EB-Habitat and EB-Manipulation require downloading large datasets from Hugging Face or GitHub repositories. Ensure Git LFS is properly initialized by running the following commands:
git lfs install
git lfs pullPlease run startx.py script before running experiment on headless servers. The server should be started in another tmux window. We use X_DISPLAY id=1 by default.
python -m embodiedbench.envs.eb_alfred.scripts.startx 1Download dataset from huggingface.
conda activate embench
git clone https://huggingface.co/datasets/EmbodiedBench/EB-ALFRED
mv EB-ALFRED embodiedbench/envs/eb_alfred/data/json_2.1.0Run the following code to ensure the EB-ALFRED environment is working correctly. Remember to start headless server.
conda activate embench
python -m embodiedbench.envs.eb_alfred.EBAlfEnv- Install Habitat-Sim and Habitat-Lab via
conda activate embench
conda install -y habitat-sim==0.3.0 withbullet headless -c conda-forge -c aihabitat
git clone -b 'v0.3.0' --depth 1 https://github.com/facebookresearch/habitat-lab.git ./habitat-lab
cd ./habitat-lab
pip install -e habitat-lab
cd ..- Download YCB and ReplicaCAD dataset for the Language Rearrangement task.
conda install -y -c conda-forge git-lfs
python -m habitat_sim.utils.datasets_download --uids rearrange_task_assets
mv data embodiedbench/envs/eb_habitatAfter the above step, there should be a data folder under envs/eb_habitat.
Run the following code to ensure the EB-Habitat environment is working correctly.
conda activate embench
python -m embodiedbench.envs.eb_habitat.EBHabEnvRun the following code to ensure the EB-Navigation environment is working correctly.
conda activate embench_nav
python -m embodiedbench.envs.eb_navigation.EBNavEnv- Install Coppelia Simulator
CoppeliaSim V4.1.0 required for Ubuntu 20.04; you can find other versions here (https://www.coppeliarobotics.com/previousVersions#)
conda activate embench_man
cd embodiedbench/envs/eb_manipulation
wget https://downloads.coppeliarobotics.com/V4_1_0/CoppeliaSim_Pro_V4_1_0_Ubuntu20_04.tar.xz
tar -xf CoppeliaSim_Pro_V4_1_0_Ubuntu20_04.tar.xz
rm CoppeliaSim_Pro_V4_1_0_Ubuntu20_04.tar.xz
mv CoppeliaSim_Pro_V4_1_0_Ubuntu20_04/ /PATH/YOU/WANT/TO/PLACE/COPPELIASIM- Add the following to your ~/.bashrc file:
export COPPELIASIM_ROOT=/PATH/YOU/WANT/TO/PLACE/COPPELIASIM
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$COPPELIASIM_ROOT
export QT_QPA_PLATFORM_PLUGIN_PATH=$COPPELIASIM_ROOTRemember to source your bashrc (
source ~/.bashrc) or zshrc (source ~/.zshrc) after this.
- Install the PyRep, EB-Manipulation package and dataset:
git clone https://github.com/stepjam/PyRep.git
cd PyRep
pip install -r requirements.txt
pip install -e .
cd ..
pip install -r requirements.txt
pip install -e .
cp ./simAddOnScript_PyRep.lua $COPPELIASIM_ROOT
git clone https://huggingface.co/datasets/EmbodiedBench/EB-Manipulation
mv EB-Manipulation/data/ ./
rm -rf EB-Manipulation/
cd ../../..Remember that whenever you re-install the PyRep, simAddOnScript_PyRep.lua will be overwritten. Then, you should copy this again.
- Run the following code to ensure the EB-Manipulation is working correctly (start headless server if you have not):
conda activate embench_man
export DISPLAY=:1
python -m embodiedbench.envs.eb_manipulation.EBManEnvBefore running evaluations, set up your environment variables if you plan to use proprietary models:
export OPENAI_API_KEY="your_oai_api_key_here"
export GEMINI_API_KEY="your_gemini_api_key_here"
export ANTHROPIC_API_KEY="your_anpic_api_key_here"
export DASHSCOPE_API_KEY="your_dashscope_api_here" # the official qwen apisTo evaluate MLLMs in EmbodiedBench, activate the corresponding Conda environment and run:
conda activate embench
python -m embodiedbench.main env=eb-alf model_name=gpt-4o-mini exp_name='baseline'
python -m embodiedbench.main env=eb-hab model_name=gpt-4o-mini exp_name='baseline'
conda activate embench_nav
python -m embodiedbench.main env=eb-nav model_name=gpt-4o exp_name='baseline'
conda activate embench_man
python -m embodiedbench.main env=eb-man model_name=claude-3-5-sonnet-20241022 exp_name='baseline'You can customize the evaluation using the following flags:
-
env: The environment to test. Choose from:'eb-alf'(EB-ALFREd)'eb-hab'(EB-Habitat)'eb-man'(EB-Manipulation)'eb-nav'(EB-Navigation)
-
model_name: Full model name, including proprietary options like:'gpt-4o','gpt-4o-mini','claude-3-5-sonnet-20241022','gemini-1.5-pro','gemini-2.0-flash-exp','gemini-1.5-flash'
-
model_type: Set to'remote'by default. -
down_sample_ratio: Data sampling ratio (default1.0). Use0.1for debugging (10% of the dataset). -
language_only: IfTrue(or1), the agent receives only text input (default:False). -
eval_sets: List of subsets to evaluate (default: all subsets). -
chat_history: Enables multi-turn interaction (Falseby default, as it may reduce performance). -
n_shots: Maximum number of textual examples for in-context learning (varies by environment). -
multiview: Uses multi-view images as input (only for EB-Manipulation & EB-Navigation, default:False). -
multistep: Includes historical multi-step images (Falseby default). -
detection_box: Enables detection box input (valid for EB-ALFREd, EB-Navigation, and EB-Manipulation). -
resolution: Image resolution (default:500). -
exp_name: Name of the experiment, used in logging. -
visual_icl: Enables visual in-context learning (Falseby default). -
log_level: Sets the logging level (INFOby default). UseDEBUGfor debugging purposes. -
truncate: [Now only for EB-Navigation since other tasks normally don't require chat_history=True] Enables truncation of conversation history whenchat_history=True(Falseby default). When enabled, it automatically removes verbose content from previous conversation turns while preserving key information. Only takes effect whenchat_history=True.
⚠️ Important: Avoid enabling multiple flags simultaneously fromvisual_icl,multiview,multistep, andchat_historyto prevent excessive image inputs and conflicts.
🔧 Context Management with Truncate:
For long navigation tasks with chat_history=True, the conversation history can become quite lengthy, potentially affecting model performance and exceeding context limits. The truncate feature addresses this by preprocessing the message history and truncating repetitive prompts before sending it to the model.
The reason why there might be unnecessary but lengthy prompt is that we wish to support a "WINDOW_SIZE"(which can be set in the corresponding planner.py file) argument when chat_history is set to True. The window will select the last "WINDOW_SIZE" messages before sending to the model. Therefore, we first allow each message to contain a system prompt and then truncate all of them except the last message. This way, we avoid the case that there will be no system prompt when the message length exceeds "WINDOW_SIZE", if the system prompt is only included in the first message.
Usage Example:
# Enable chat history with truncation for better context management
conda activate embench_nav
python -m embodiedbench.main env=eb-nav model_name=gpt-4o chat_history=True truncate=True exp_name='nav_with_truncation'
# Compare with standard chat history (without truncation)
python -m embodiedbench.main env=eb-nav model_name=gpt-4o chat_history=True truncate=False exp_name='nav_standard_history'
# Standard evaluation without chat history (truncate has no effect)
python -m embodiedbench.main env=eb-nav model_name=gpt-4o chat_history=False exp_name='nav_no_history'When to use truncate=True:
- Long navigation episodes (>10 steps) with
chat_history=True - Models with limited context windows
- When experiencing performance degradation due to overly long conversation history
- To reduce API costs for proprietary models by managing token usage
We support two deployment methods for open-source models: offline running and model serving.
For local execution, set model_type=local and adjust tp (tensor parallelism) based on GPU memory.
- A rough guideline: For 48GB GPUs, use
tp = ceil(model size (in B) / 10).
conda activate embench
python -m embodiedbench.main env=eb-alf model_name=Qwen/Qwen2-VL-7B-Instruct model_type=local exp_name='baseline' tp=1
python -m embodiedbench.main env=eb-hab model_name=OpenGVLab/InternVL2_5-8B model_type=local exp_name='baseline' tp=1
conda activate embench_nav
python -m embodiedbench.main env=eb-nav model_name=OpenGVLab/InternVL2_5-38B model_type=local exp_name='baseline' tp=4
conda activate embench_man
python -m embodiedbench.main env=eb-man model_name=meta-llama/Llama-3.2-11B-Vision-Instruct model_type=local exp_name='baseline' tp=2Model serving decouples model execution from evaluation, allowing flexible deployment via API calls.
## Step 0, create an environment for lmdeploy
conda env create -f conda_envs/lmdeploy.yaml
conda activate lmdeploy
pip install lmdeploy
## Step 1, open another tmux window, runing the model
lmdeploy serve api_server "OpenGVLab/InternVL2_5-8B" --server-port $port --tp 1
## Step 2, running the evaluation
conda activate embench
export remote_url='IP_address:port/v1' # set the address for access, e.g., https://localhost:8000.
python -m embodiedbench.main env=eb-hab model_name=OpenGVLab/InternVL2_5-8B exp_name='baseline' You can also refer to LMDeploy for more details.
Lmdeploy often lags behind the release of new models. To address this, we offer a more flexible and dynamic model serving approach. Follow these steps to deploy and evaluate new models:
## 1. Modify the code and hyperparameters in `server.py` according to your requirements.
## We now support "microsoft/Phi-4-multimodal-instruct", 'AIDC-AI/Ovis2-16B', 'AIDC-AI/Ovis2-34B', 'google/gemma-3-12b-it'
## 2. Start the server and install any necessary packages:
pip install flask
CUDA_VISIBLE_DEVICES=${gpu_ids} python server.py
## 3. Run the evaluation in custom mode:
export server_url="IP_address:port/process"
python -m embodiedbench.main env=eb-hab model_name='microsoft/Phi-4-multimodal-instruct' model_type='custom' exp_name='new_model'We have provided a Docker file under the Docker folder.
This repo is based on awesome embodied benchmarks and simulations Lota-Bench, ALFRED, ai2thor, EmbodiedAgentInterface, ML-Llarp, Habitat, VLMBench, and RLBench. Our open-source model deployment is based on LMDeploy and vllm.
@inproceedings{
yang2025embodiedbench,
title={EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents},
author={Rui Yang and Hanyang Chen and Junyu Zhang and Mark Zhao and Cheng Qian and Kangrui Wang and Qineng Wang and Teja Venkat Koripella and Marziyeh Movahedi and Manling Li and Heng Ji and Huan Zhang and Tong Zhang},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=DgGF2LEBPS}
}

