Julian Quevedo1, Ansh Kumar Sharma2, Yixiang Sun2, Varad Suryavanshi2, Percy Liang1, Sherry Yang1,2,3
Stanford University1 New York University2 Google DeepMind3
This repository contains the evaluation harness used in Evaluating Robot Policies in a World Model. It bundles
- the pretrained diffusion world model,
- policy-specific runners for OpenVLA, Octo, SpatialVLA, and RT-1-X, and
- utilities for dataset conversion and automatic VLM scoring.
Install the package in editable mode (optionally with extras for specific policies):
pip install -e .[openvla,spatialvla,octo,rt1]Extras are additive—omit the ones you do not need. Some stacks have additional one-off steps:
- Octo –
- install the dlimp library:
pip install git+https://github.com/kvablack/dlimp@5edaa4691567873d495633f2708982b42edf1972 --no-deps - edit the installed Octo package (typically under your Python site-packages) and update
octo/utils/typing.pyso that it definesPRNGKey = jax.random.PRNGKey.
- install the dlimp library:
- RT-1-X – obtain the official JAX checkpoint from the Open X-Embodiment release.
The evaluation runners require a diffusion world-model checkpoint, e.g. mixed_openx_9robots_20frames_0p1actiondropout_580ksteps.pt (≈9 GB). This checkpoint is available via this Google Drive link:
$ pip install gdown
$ gdown 1uiRP2BuavapMsyP9Cbr25mi_ymk9SEJbPoint every runner’s --root-dir flag at the directory whose subfolders contain *.png + metadata pairs. The helper discover_trials recursively discovers tasks from that root.
world-model-eval-openvla \
--root-dir /path/to/tasks \
--checkpoint-path ~/checkpoints/world-model/mixed_openx_9robots_20frames_0p1actiondropout_580ksteps.pt \
--model-name openvla-7b \
--save-video --video-out-dir ./rollouts/openvlaworld-model-eval-spatialvla \
--root-dir /path/to/tasks \
--checkpoint-path ~/checkpoints/world-model/mixed_openx_9robots_20frames_0p1actiondropout_580ksteps.pt \
--model-name spatialvla-4b-224-ptworld-model-eval-octo \
--root-dir /path/to/tasks \
--checkpoint-path ~/checkpoints/world-model/mixed_openx_9robots_20frames_0p1actiondropout_580ksteps.pt \
--model-name octo-base-1.5The RT-1 runner uses Abseil flags:
world-model-eval-rt1 \
--root_dir /path/to/tasks \
--checkpoint_path /path/to/rt1x_checkpoint \
--world_model_checkpoint ~/checkpoints/world-model/mixed_openx_9robots_20frames_0p1actiondropout_580ksteps.ptPass --save_video / --video_out_dir counterparts where available if you want MP4 rollouts.
This is how you launch training. It will train on the tiny 10-example dataset in sample_data/.
# Replace N with the number of available GPUs
torchrun --nproc_per_node=N train.pyCheckpoints and generated GIF samples will be written to outputs/<timestamp>/.
To train on the Open X-Embodiment datasets we used in the paper:
# We'll need tensorflow datasets and tensorflow since this code is
# based on the original Open X-Embodiment repo.
pip install tensorflow tensorflow_datasets
# For example, download just the RT-1 dataset:
python -m world_model_eval.download_data --dataset_name rt_1
# By default the data will be written to ./converted_datasets.
# To choose your own output directory:
python -m world_model_eval.download_data --dataset_name rt_1 --output_dir <your output dir>See world_model_eval/download_data.py for more dataset names to choose from.
Then launch training with the correct dataset path:
torchrun --nproc_per_node=N -m world_model_eval.train --dataset_dir ./converted_datasets --subset_names rt_1
# Replace ./converted_datasets if your path is different.You can enter a comma separated list for subset_names to train on a mixture of multiple datasets. For example, after downloading the rt_1 and bridge_v2 datasets, you can do --subset_names rt_1,bridge_v2 to train on both the RT-1 and Bridge V2 datasets.
Since Bridge V2 was not included in the original Open X-Embodiment dataset, you'll need to first download the TFDS dataset to your machine like this:
wget -r -np -R "index.html*" https://rail.eecs.berkeley.edu/datasets/bridge_release/data/tfds/bridge_dataset/
Then, convert the dataset to our format with python -m world_model_eval.download_data --dataset_name bridge_v2, changing BRIDGE_V2_PATH at the top of the script if necessary. Since Bridge V2 is a superset of Bridge V1, choose between either downloading bridge or bridge_v2.
If you find this work useful, please cite:
@misc{quevedo2025worldgymworldmodelenvironment,
title={WorldGym: World Model as An Environment for Policy Evaluation},
author={Julian Quevedo and Ansh Kumar Sharma and Yixiang Sun and Varad Suryavanshi and Percy Liang and Sherry Yang},
year={2025},
eprint={2506.00613},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2506.00613},
}



