CARVIEW

MOTORHOMES

Select Language

HTTP/2 301 server: GitHub.com content-type: text/html location: https://www.pair.toronto.edu/Adapt3R x-github-request-id: 886A:2F7ECD:9B27EE:AE43DE:69536B83 accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 06:04:51 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210072-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767074691.214150,VS0,VE201 vary: Accept-Encoding x-fastly-request-id: b07be0f76db1d71db4aa5430cc581cb03cb7173b content-length: 162 HTTP/1.1 301 Moved Permanently Connection: keep-alive Content-Length: 162 Server: GitHub.com Content-Type: text/html Location: https://www.pair.toronto.edu/Adapt3R/ X-GitHub-Request-Id: 596E:1387E:99F9DC:AD18EB:69536B82 Accept-Ranges: bytes Age: 0 Date: Tue, 30 Dec 2025 06:04:51 GMT Via: 1.1 varnish X-Served-By: cache-bom-vanm7210038-BOM X-Cache: MISS X-Cache-Hits: 0 X-Timer: S1767074692.762447,VS0,VE197 Vary: Accept-Encoding X-Fastly-Request-ID: 5ad2a2f25f5a31a641d53073e655ec4d11aeed0f HTTP/1.1 200 OK Connection: keep-alive Content-Length: 6605 Server: GitHub.com Content-Type: text/html; charset=utf-8 Last-Modified: Thu, 15 May 2025 17:59:57 GMT Access-Control-Allow-Origin: * ETag: W/"68262b9d-e224" expires: Tue, 30 Dec 2025 06:14:52 GMT Cache-Control: max-age=600 Content-Encoding: gzip x-proxy-cache: MISS X-GitHub-Request-Id: C1A2:3157C7:99A4FE:ACC225:69536B83 Accept-Ranges: bytes Date: Tue, 30 Dec 2025 06:04:52 GMT Via: 1.1 varnish Age: 0 X-Served-By: cache-bom-vanm7210038-BOM X-Cache: MISS X-Cache-Hits: 0 X-Timer: S1767074692.973744,VS0,VE220 Vary: Accept-Encoding X-Fastly-Request-ID: 3a8e7d2d911a2a5c1181433697be002c5291bbfe Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning

Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning

Albert Wilcox^1,2, Mohamed Ghanem¹, Masoud Moghani³, Pierre Barroso², Benjamin Joffe^1,2, Animesh Garg¹

¹Georgia Institute of Technology, ²Georgia Tech Research Institute, ³University of Toronto

arXiv Code

TL;DR Adapt3R is a 3D perception encoder that, combined with any of a variety of IL algorithms, enables zero-shot transfer to unseen embodiments and camera viewpoints.

Abstract

Imitation Learning can train robots to perform complex and diverse manipulation tasks, but learned policies are brittle with observations outside of the training distribution. 3D scene representations that incorporate observations from calibrated RGBD cameras have been proposed as a way to mitigate this, but in our evaluations with unseen embodiments and camera viewpoints they show only modest improvement. To address those challenges, we propose Adapt3R, a general-purpose 3D observation encoder which synthesizes data from calibrated RGBD cameras into a vector that can be used as conditioning for arbitrary IL algorithms. The key idea is to use a pretrained 2D backbone to extract semantic information, using 3D only as a medium to localize this information with respect to the end-effector. We show across 93 simulated and 6 real tasks that when trained end-to-end with a variety of IL algorithms, Adapt3R maintains these algorithms' learning capacity while enabling zero-shot transfer to novel embodiments and camera poses.

Adapt3R extracts scene representations from RGBD inputs for imitation learning, and is designed to work well with a variety of state-of-the-art imitation learning algorithms. It starts by lifting pre-trained foundation model features on RGBD inputs into a 3D scene representation. Then, after a carefully designed point cloud processing step, it uses attention pooling to compress the point cloud into a single vector z which can be used as conditioning for a policy in an end-to-end learning setup.

Real Robot Videos

We train a language-conditioned multitask Adapt3R policy to complete 6 tasks on a real UR5 robot.

Unseen Viewpoint

Adapt3R enables zero-shot transfer to a novel viewpoint which views the scene from a completely different angle.

Attention Maps

We visualize attention maps from our attention pooling operation. In the training distribution, we see that Adapt3R mainly attends to the task-relevant objects.

Attention Maps with Unseen Camera View

Now, after we move the camera to an unseen viewpoint, we see that Adapt3R attends to the same task-relevant objects, helping to minimize the domain gap.

LIBERO-90 Videos

We train a multitask Adapt3R policy to complete 90 tasks from the LIBERO-90 benchmark.

Novel Embodiment

After training only with the Franka Panda robot, Adapt3R enables zero-shot transfer to novel embodiments.

Kinova3

UR5e

Kuka IIWA

Novel Camera Pose

Adapt3R enables zero-shot transfer to new camera poses. Here we show changes with a small, medium and large difference in camera pose.

Small Camera Change

Medium Camera Change

Large Camera Change

BibTeX

@misc{wilcox2025adapt3r,
    title={Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning}, 
    author={Albert Wilcox and Mohamed Ghanem and Masoud Moghani and Pierre Barroso and Benjamin Joffe and Animesh Garg},
    year={2025},
    eprint={2503.04877},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2503.04877}}

Original Source | Taken Source