| CARVIEW |
Select Language
Abstract
The pre-training of visual representations has enhanced the efficiency of robot learning.
Due to the lack of large-scale in-domain robotic datasets, prior works utilize in-the-wild human videos to pre-train robotic visual representation.
Despite their promising results, representations from human videos are inevitably subject to distribution shifts and lack the dynamics information crucial for task completion.
We first evaluate various pre-trained representations in terms of their correlation to the downstream robotic manipulation tasks (i.e., manipulation centricity).
Interestingly, we find that the "manipulation centricity" is a strong indicator of success rates when applied to downstream tasks.
Drawing from these findings, we propose Manipulation Centric Representation (MCR), a foundation representation learning framework capturing both visual features and the dynamics information such as actions and proprioceptions of manipulation tasks to improve manipulation centricity.
Specifically, we pre-train a visual encoder on the DROID robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions.
We introduce a novel contrastive loss that aligns visual observations with the robot's proprioceptive state-action dynamics, combined with an action prediction loss and a time contrastive loss during pre-training.
Empirical results across four simulation domains with 20 robotic manipulation tasks demonstrate that MCR outperforms the strongest baseline by 14.8%. Additionally, MCR boosts the success rate in three real-world manipulation tasks by 76.9%.
Manipulation Centricity
Through analyzing feature similarities between Grad-CAM visualizations and SAM2-identified ground truth regions, Manipulation Centricity quantifies a representation's focus on task-relevant areas, predicting downstream performance.
Real-world Manipulation Benchmark
MCR consistently outperforms baselines
across all real-world task
across all real-world task
Grad-CAM visualization on Rearrange.
MCR is with best manipulation centricty.
MCR is with best manipulation centricty.
Simulation Benchmark
Grad-CAM visualization for the Square task from Robomimic and the Pick Place Wall task from MetaWorld
4 Domains: MetaWorld, DexArt, Robomimic, RoboCasa; 20 Tasks
Findings in robotic datasets
Larger dataset, better performance.
Benefits for tasks with less embodiment gap.
Feature Analysis
We do t-SNE visualization on 10 simulation tasks from MetaWorld and 3 real robot tasks. Each dot represents an image frame and each color indicates a task. The results demonstrate that (1) our representation has the best clustering ability and (2) robot data is helpful to robotic representation.
BibTeX
If you find the project helpful for your research, please consider citing our paper:@article{jiang2024robots,
title={Robots Pre-Train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets},
author={Jiang, Guangqi and Sun, Yifei and Huang, Tao and Li, Huanyu and Liang, Yongyuan and Xu, Huazhe},
journal={arXiv preprint arXiv:2410.22325},
year={2024}
}