| CARVIEW |
3D Vision Language Models (VLMs) for Robotic Manipulation: Opportunities and Challenges
June 11, 2025, Nashville, TN. Location: 101 A
Introduction
The intersection of 3D Vision-and-Language models (3D VLMs) in robotics presents a new frontier, blending spatial understanding with contextual reasoning. The Robo-3DVLM workshop seeks to explore the opportunities and challenges posed by integrating these technologies to enhance robot perception, decision-making, and interaction with the real world. As robots evolve to operate in increasingly complex environments, bridging the gap between 3D spatial reasoning and language understanding becomes critical. Key questions at the heart of this workshop include:
- High-level vs. Low-level Representations: Is 3D vision crucial, or can 2D representations suffice for robotic tasks? How should robots interpret the world—through point clouds, 3D bounding boxes, or other output formats? What input modalities offer the most efficiency and generalization?
- Pretraining for Policy Learning: Do low-level policies require extensive pretraining, or could projections from vision-language models serve as adequate features? Is it possible that distilled 2D features are sufficient for policy learning, or is a deeper, 3D-centric approach needed?
- 3D Vision-Language Action Models: What are the specific challenges in using 3D VLMs for robotic actions, particularly regarding sensor calibration and real-time performance?
Call for Papers
We are excited to announce the Call for Papers for the Robo-3DVLM workshop. We invite original contributions presenting novel ideas, research, and applications relevant to the workshop’s theme.
Important Dates
| Event | Date |
|---|---|
| Call for Papers | January 30th, 2025 |
| Submission Deadline | May 16nd, 2025, 23:59 PST |
| Notification | May 20th, 2025 |
| Camera-Ready | May 25th, 2025 |
Submission Guidelines
- Page Limit: Submissions can be up to 4 pages for the main content. There is no limit on the number of pages for references or appendices.
- Formatting: Submissions are encouraged to use the CVPR template.
- Anonymity: All submissions must be anonymized. Please remove any author names, affiliations, or identifying information.
- Relevant Work: We welcome references to recently published, relevant work (e.g., RSS, CoRL, ICRA, and ICML).
- Archival Status: All accepted papers are non-archival.
- Link: openreview submission
Accepted papers will be presented in the form of posters at the workshop. In addition, selected papers may be invited to deliver spotlight talks.
Paper topics
A non-exhaustive list of relevant topics:
- 3D Vision-Language Policy Learning
- Pretraining for 3D Vision-Language Models
- 3D Representations for Policy Learning (i.e. NeRF, Gaussian Splatting, SDF)
- 3D Benchmarks and Simulatotion frameworks
- 3D Vision-Language Action Models
- 3D Vision-Language or Large-Language Models for Robotics
- 3D Instruction-tuning datasets for Robotics
- 3D pretraining datasets for Robotics
- Other topics about 3D Vision-Language Models for Robotic Manipulations
Workshop Schedule (Tentative)
| Start Time (CDT) | End Time (CDT) | Event |
|---|---|---|
| 9:00 AM | 9:10 AM | Opening remarks |
| 9:10 AM | 9:45 AM | Hao Su Exploring World Model for Robotic Manipulation |
| 9:45 AM | 10:20 AM | Chelsea Finn Pretraining and Posttraining Robotic Foundation Models |
| 10:20 AM | 10:55 AM | Ranjay Krishna Preparing perception for robotics |
| 10:55 AM | 11:10 AM | Coffee Break |
| 11:10 AM | 11:45 AM | Yunzhu Li Foundation Models for Structured Scene Modeling in Robotic Manipulation |
| 11:45 AM | 12:20 PM | Katerina Fragkiada 3D Generative Manipulation Policies: Bridging 2D Pre-training with 3D Scene Reasoning |
| 12:20 PM | 1:30 PM | Lunch |
| 1:30 PM | 2:00 PM | Poster Session (ExHall D, #357-#371 |
| 2:00 PM | 2:35 PM | Angel Chang Building vision-language maps for embodied AI |
| 2:35 PM | 3:10 PM | Dieter Fox Hierarchical Action Models for Open-World 3D Policies |
| 3:10 PM | 3:25 PM | Coffee Break |
| 3:25 PM | 4:00 PM | Chuang Gan Genesis: An Unified and Generative Physics Simulation for Robotics |
| 4:00 PM | 4:45 PM |
Spotlight Paper Talks (5 min talk / 2 min Q&A) • The One RING: A Robotic Indoor Navigation Generalist • Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models • Agentic Language-Grounded Adaptive Robotic Assembly • ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos |
| 4:45 PM | 5:00 PM | Ending Remarks and Paper Awards |
Invited Speakers
listed alphabetically
Organizers
For inquiries, contact us at: robo-3dvlm@googlegroups.com