CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Thu, 23 Jan 2025 04:29:29 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"6791c5a9-5cd4" expires: Mon, 29 Dec 2025 00:20:56 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 4C77:292AC1:817623:916A8F:6951C70F accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 00:10:56 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210056-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766967056.911233,VS0,VE206 vary: Accept-Encoding x-fastly-request-id: c0d966ca38b47d7a1b5ad764f55982fbaa99b775 content-length: 4879 Robots Pre-Train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets

🧸Robots Pre-Train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets

Guangqi Jiang^*¹    Yifei Sun^*²    Tao Huang^*³    Huanyu Li³
Yongyuan Liang^†⁴    Huazhe Xu^†⁵
¹UC San Diego          ²Tongji University          ³Shanghai Jiao Tong University
⁴University of Maryland          ⁵Tsinghua University
* Equal contribution. † Equal advising.
💐 ICLR 2025 🎉

arXiv Code

Models Twitter

Manipulation Centricity measures how well pre-trained visual representations correlate with downstream manipulation tasks, serving as a strong predictor of task success rates. Building on this insight, Manipulation Centric Representation (MCR) enhances manipulation centricity by pre-training visual encoders with large-scale robotic data.

Abstract

The pre-training of visual representations has enhanced the efficiency of robot learning. Due to the lack of large-scale in-domain robotic datasets, prior works utilize in-the-wild human videos to pre-train robotic visual representation. Despite their promising results, representations from human videos are inevitably subject to distribution shifts and lack the dynamics information crucial for task completion. We first evaluate various pre-trained representations in terms of their correlation to the downstream robotic manipulation tasks (i.e., manipulation centricity). Interestingly, we find that the "manipulation centricity" is a strong indicator of success rates when applied to downstream tasks. Drawing from these findings, we propose Manipulation Centric Representation (MCR), a foundation representation learning framework capturing both visual features and the dynamics information such as actions and proprioceptions of manipulation tasks to improve manipulation centricity. Specifically, we pre-train a visual encoder on the DROID robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions. We introduce a novel contrastive loss that aligns visual observations with the robot's proprioceptive state-action dynamics, combined with an action prediction loss and a time contrastive loss during pre-training. Empirical results across four simulation domains with 20 robotic manipulation tasks demonstrate that MCR outperforms the strongest baseline by 14.8%. Additionally, MCR boosts the success rate in three real-world manipulation tasks by 76.9%.

Manipulation Centricity

Through analyzing feature similarities between Grad-CAM visualizations and SAM2-identified ground truth regions, Manipulation Centricity quantifies a representation's focus on task-relevant areas, predicting downstream performance.

Real-world Manipulation Benchmark

Lift up (2x speed)

Sweep (2x speed)

Rearrange (2x speed)

MCR consistently outperforms baselines
across all real-world task

Grad-CAM visualization on Rearrange.
MCR is with best manipulation centricty.

Simulation Benchmark

Grad-CAM visualization for the Square task from Robomimic and the Pick Place Wall task from MetaWorld

4 Domains: MetaWorld, DexArt, Robomimic, RoboCasa; 20 Tasks

Findings in robotic datasets

Larger dataset, better performance.

Benefits for tasks with less embodiment gap.

Feature Analysis

We do t-SNE visualization on 10 simulation tasks from MetaWorld and 3 real robot tasks. Each dot represents an image frame and each color indicates a task. The results demonstrate that (1) our representation has the best clustering ability and (2) robot data is helpful to robotic representation.

BibTeX

If you find the project helpful for your research, please consider citing our paper:

@article{jiang2024robots,
        title={Robots Pre-Train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets},
        author={Jiang, Guangqi and Sun, Yifei and Huang, Tao and Li, Huanyu and Liang, Yongyuan and Xu, Huazhe},
        journal={arXiv preprint arXiv:2410.22325},
        year={2024}
      }

Original Source | Taken Source