| CARVIEW |
Generalizable Priors for Robot Manipulation
Overview
Generalist robot policies are those capable of performing complex manipulation tasks across a wide range of environments. Recent years have seen significant progress toward this goal, driven by advances in large-scale teleoperated robot datasets, structured representations for policy learning, hierarchical planning with vision-language models (VLMs), and online learning for adaptation to novel tasks and environments. Although multiple promising approaches are emerging, it is essential to understand the tradeoffs inherent in each to develop methods that not only generalize to new scenarios but also execute tasks with high precision and reliability.
An ideal framework for learning such generalizable policies should: (1) scale beyond simple pick-and-place tasks and continually improve as more task data becomes available, (2) learn effectively from diverse data sources, including robot teleoperation, simulated environments, and human demonstration videos, and (3) utilize representations of the world that are applicable across tasks requiring varying levels of precision and dexterity. This workshop will focus on a central question: What are the right priors for generalizable policy learning, and how can we best incorporate these priors into policy learning frameworks?
Our speakers and panelists are leading researchers in robotics and machine learning, working at the forefront of topics including end-to-end control, sim-to-real transfer, learning from human videos, and large-scale robotic data collection, among others. We invite the community to submit their latest work and ideas for discussion.
Areas of Interest
We aim to investigate the following topics and research questions:
- What main bottlenecks prevent current robot policies from generalizing to unseen environments?
- Is the challenge primarily a data problem, or are there fundamental limitations in current approaches?
- Can pretrained models from related fields, such as vision and language, be leveraged to improve robot policies?
- What is the most effective way to incorporate these models into robotics?
- Should we fine-tune them on robot data, or use their representations and outputs for downstream policy learning?
- How can we best utilize simulated environments to promote generalization?
- Do simulated environments need to closely resemble the real world for successful policy transfer, or can they be substantially different?
- What role do generative models play in the sim-to-real pipeline? Are they consistent enough to make reliable long-term predictions over full trajectories?
- Is it possible to design universal priors that work across a wide range of tasks and domains, including simulation, the real world, and internet videos?
- Can object-centric representations provide this capability?
- Can 3D representations facilitate cross-domain transfer?
- How should we design standardized benchmarks for evaluating such generalist policies?
Schedule
Session 1 |
|
| 9:30 AM - 9:40 AM | Opening Remarks |
| 9:40 AM - 10:05 AM | Talk - Prof. Yang Gao Title: Scaling Robot Manipulation with VLMs and Human Videos: Lessons Learned |
| 10:05 AM - 10:30 AM | Talk - Prof. Harold Soh |
| 10:30 AM - 11:00 AM | Poster Session, Coffee Break |
Session 2 |
|
| 11:00 AM - 11:25 AM | Talk - Dr. Ajay Mandlekar Title: Scaling Synthetic Data Generation for Robotics with Point-Based Representations |
| 11:25 AM - 11:50 AM | Talk - Prof. Georgia Chalvatzaki Title: Structured Priors for Efficient Robot Learning |
| 11:50 AM - 12:15 PM | Talk - Prof. Edward Johns Title: The Priors Needed for In-Context Imitation Learning |
| 12:15 AM - 12:30 AM | Spotlight Talks |
| 12:30 PM - 1:30 PM | Lunch Break |
Session 3 |
|
| 1:30 PM - 2:00 PM | Talk - Prof. Jeannette Bohg Title:Three Layers of Priors for Generalizable Robot Manipulation: From Control to Human Data to Physics |
| 2:00 PM - 2:30 PM | Talk - Prof. Xiaolong Wang Title: Going Beyond Teleoperation for Humanoid Manipulation |
| 2:30 PM - 3:00 PM | Coffee Break, Poster Session |
Session 4 |
|
| 3:00 PM - 3:25 PM | Talk - Jiafei Duan Title: Grounding Vision and Language Models for robotics manipulation |
| 3:25 PM - 4:20 PM | Panel Discussion |
| 4:20 PM - 4:30 PM | Closing Remarks and Awards |
Presentation Instructions
Each poster panel will be shared between two papers (arranged side-by-side, vertically). Please make sure your poster does not exceed 0.92 m (H) × 0.94 m (W) (close to A0 portrait, but slightly scaled down).
Each oral presentation will be a 5-minute spotlight (4 minutes for the talk and 1 minute for Q&A).
All accepted papers will be presented as posters.
Invited Speakers
Jeannette Bohg
Stanford University USA
Ajay Mandlekar
NVIDIA USA
Xiaolong Wang
University of California, San Diego USA
Jiafei Duan
University of Washington / AI2 USA
Yang Gao
Tsinghua University China
Georgia Chalvatzaki
TU Darmstadt Germany
Harold Soh
National University of Singapore Singapore
Edward Johns
Imperial College London UKAccepted Papers
- [Best Paper Award, Spotlight] GLOVER: Generalizable Open-Vocabulary Affordance Reasoning for Task-Oriented Grasping
- [Spotlight] HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data for Mobile Dexterous Manipulation
- [Spotlight] Touch begins where vision ends: Generalizable policies for contact-rich manipulation
- VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models
- Learning to Act Through Contact: A Unified View of Multi-Task Robot Learning
- Slot-Based Object-Centric Representations Improve Policy Generalization in Robot Manipulation
- Generalization of Manipulation Skills using Keypoint Priors
- Unifying What and How: Distilling a Pre-trained Unified Skill Representation for Efficient Adaptation
- cVLA: Towards Efficient Camera-Space VLAs
- Point2Act: Efficient 3D Distillation of Multimodal LLMs for Zero-Shot Context-Aware Grasping
- Guaranteed $SE(3)$-Equivariant Control via Hand-Centric Behavior Cloning
- AURA: Autonomous Upskilling with Retrieval-Augmented Agents
- Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter
- Demystifying Diffusion Policies: Action Memorization and Simple Lookup Table Alternatives
- RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
- Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions
- Mash, Spread, Slice! Learning to Manipulate Object States via Visual Spatial Progress
Organizing Committee
Siddhant Haldar
New York University USA
Mara Levy
University of Maryland USA
Jiafei Duan
University of Washington USA
Ivan Kapelyukh
Imperial College London UK