| CARVIEW |
Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning
Abstract
Imitation Learning can train robots to perform complex and diverse manipulation tasks, but learned policies are brittle with observations outside of the training distribution. 3D scene representations that incorporate observations from calibrated RGBD cameras have been proposed as a way to mitigate this, but in our evaluations with unseen embodiments and camera viewpoints they show only modest improvement. To address those challenges, we propose Adapt3R, a general-purpose 3D observation encoder which synthesizes data from calibrated RGBD cameras into a vector that can be used as conditioning for arbitrary IL algorithms. The key idea is to use a pretrained 2D backbone to extract semantic information, using 3D only as a medium to localize this information with respect to the end-effector. We show across 93 simulated and 6 real tasks that when trained end-to-end with a variety of IL algorithms, Adapt3R maintains these algorithms' learning capacity while enabling zero-shot transfer to novel embodiments and camera poses.
Real Robot Videos
We train a language-conditioned multitask Adapt3R policy to complete 6 tasks on a real UR5 robot.
Unseen Viewpoint
Adapt3R enables zero-shot transfer to a novel viewpoint which views the scene from a completely different angle.
Attention Maps
We visualize attention maps from our attention pooling operation. In the training distribution, we see that Adapt3R mainly attends to the task-relevant objects.
Attention Maps with Unseen Camera View
Now, after we move the camera to an unseen viewpoint, we see that Adapt3R attends to the same task-relevant objects, helping to minimize the domain gap.
LIBERO-90 Videos
We train a multitask Adapt3R policy to complete 90 tasks from the LIBERO-90 benchmark.
Novel Embodiment
After training only with the Franka Panda robot, Adapt3R enables zero-shot transfer to novel embodiments.
Kinova3
UR5e
Kuka IIWA
Novel Camera Pose
Adapt3R enables zero-shot transfer to new camera poses. Here we show changes with a small, medium and large difference in camera pose.
Small Camera Change
Medium Camera Change
Large Camera Change
BibTeX
@misc{wilcox2025adapt3r,
title={Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning},
author={Albert Wilcox and Mohamed Ghanem and Masoud Moghani and Pierre Barroso and Benjamin Joffe and Animesh Garg},
year={2025},
eprint={2503.04877},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.04877}}