CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Mon, 03 Mar 2025 03:51:07 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"67c5272b-c529" expires: Mon, 29 Dec 2025 07:26:44 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 5CF4:2D8B9D:86E4D2:979965:69522ADC accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 07:16:44 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210066-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766992604.227738,VS0,VE218 vary: Accept-Encoding x-fastly-request-id: 0948eeae311a987b45d6b2f07e1d43ca3de62276 content-length: 6798 Colosseum

The COLOSSEUM:
A Benchmark for Evaluating Generalization for Robotic Manipulation

RSS 2024

Wilbert Pumacay^*¹, Ishika Singh^*², Jiafei Duan^*³, Ranjay Krishna^{3, 4} Jesse Thomason² Dieter Fox^{3, 5}

¹Universidad Católica San Pablo ²University of Southern California ³University of Washington ⁴Allen Institute for Artifical Intelligence ⁵NVIDIA

* Equal contribution

ArXiv Get started with the Colosseum Code
Real World Setup Leaderboard
Dataset

Abstract

To realize effective large-scale, real-world robotic applications, we must evaluate how well our robot policies adapt to changes in environmental conditions. Unfortunately, a majority of studies evaluate robot performance in environments closely resembling or even identical to the training setup.

We present Colosseum, a novel simulation benchmark,with 20 diverse manipulation tasks, that enables systematical evaluation of models across 14 axes of environmental perturbations. These perturbations include changes in color, texture, and size of objects, table-tops, background and object physical properties; we also vary lighting, distractors, and camera pose. Using Colosseum, we compare 5 state-of-the-art manipulation models to reveal that their success rate degrades between 30-50% across these perturbation factors.

When multiple perturbations are applied in unison, the success rate degrades > 75%. We identify that changing the number of distractor objects, target object color, or lighting conditions are the perturbations that reduce model performance the most. To verify the ecological validity of our results, we show that our results in simulation are correlated (R² = 0.614) to similar perturbations in real-world experiments. We open source code for others to use Colosseum, and also release code to 3D print the objects used to replicate the real-world perturbations. Ultimately, we hope that Colosseum will serve as a benchmark to identify modeling decisions that systematically improve generalization for manipulation.

Leaderboard on THE COLOSSEUM

Perturbations

Perturbations Factors

Perturbation

applied to task

Failure cases

Failure cases for PerAct

Failure cases for RVT

Failure cases for R3M

Failure cases for MVP

Failure cases for VOXPOSER

Reproducibility in real-world experiments

All the object assets are 3D printed and will be released.

Examples of real-world perturbation results with PerAct

Slide_block_to_target

Insert_on_square_peg

Setup_chess

Scoop_with_spatula

BibTeX

@article{pumacay2024colosseum,
  title     = {THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation}, 
  author    = {Pumacay, Wilbert and Singh, Ishika and Duan, Jiafei and Krishna, Ranjay and Thomason, Jesse and Fox, Dieter},
  booktitle = {arXiv preprint arXiv:2402.08191},
  year      = {2024},
}

Original Source | Taken Source

The COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation