| CARVIEW |
Select Language
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Fri, 28 Jun 2024 19:57:49 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"667f15bd-84da"
expires: Sun, 28 Dec 2025 12:24:03 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 1826:1387E:78BB08:876C1C:69511F02
accept-ranges: bytes
age: 0
date: Sun, 28 Dec 2025 12:14:03 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210084-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766924043.981872,VS0,VE204
vary: Accept-Encoding
x-fastly-request-id: 5d52addfb1f2087a017554a4ccd27c9f70e761d1
content-length: 7657
VLMNM Workshop @ ICRA 2024
With the rising capabilities of LLMs and VLMs, the past two years have seen a surge in research work using VLMs for navigation and manipulation.
Fusing the capabilities of visual interpretation with natural language processing, these models are poised to redefine how robotic systems interact
with both their environment and human counterparts. The relevance of this topic cannot be overstated; as the frontier of human-robot interaction expands,
so does the necessity for robots to comprehend and operate within complex environments using naturalistic instructions. Our workshop will not only reflect
the state-of-the-art advancements in this domain, by featuring a diverse set of speakers, from senior academics to researchers in early careers, from industry
researchers to companies producing mobile manipulation platforms, from researchers who are enthusiastic about using VLMs for robotics to those who have
reservations about it. We aim for this event to be a catalyst for originality and diversity at ICRA 2024. We believe that, amidst a sea of workshops, ours
will provide unique perspectives that will push the boundaries of what's achievable in robot navigation and manipulation.
In this workshop, we plan to discuss:
We invite submissions including but not limited to the following topics:
Important Dates:
For further information or questions, please contact vlm-navigation-manipulation-workshop [AT] googlegroups [DOT] com
Vision-Language Models for Navigation and Manipulation (VLMNM)
Full-day hybrid workshop at ICRA 2024, Room 315, Yokohama (Japan)
Friday, May 17, 9 am - 5 pm (JST)
Recordings of invited talks can be found in our YouTube Channel
In this workshop, we plan to discuss:
- How can VLMs/LLMs enhance robotics navigation and manipulation?
- How to extract world knowledge from pre-trained VLMs/LLMs and apply them to navigation and manipulation?
- How to integrate VLMs/LLMs with robot components, such as perception, control, and planning? How to account for partial observability and uncertainty?
- Benchmarks and datasets to assess the generalization capabilities of VLMs/LLMs for navigation and manipulation.
- Capabilities and limitations of VLMs/LLMs for navigation and manipulation (e.g. in task planning, spatial understanding)
- New interaction modes between robots and humans enabled by VLMs/LLMs.
Final Schedule
| Time (JST) | Event | Description | Time (PDT) (May 16) |
|||||
|---|---|---|---|---|---|---|---|---|
| 8:30 - 8:50 | Coffee and Pasteries Poster presenters set up posters |
16:30 - 16:50 | ||||||
| 8:50 - 9:00 | Introduction | 16:50 - 17:00 | ||||||
| 9:10 - 9:35 |
LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks
[Recording]
Prof. Subbarao Kambhampati | Arizona State University |
|
17:10 - 17:35 | |||||
| 9:35 - 10:00 |
LLM-based Task and Motion Planning for Robots
[Recording]
Prof. Chuchu Fan | Massachusetts Institute of Technology |
|
17:35 - 18:00 | |||||
| 10:00 - 10:20 |
On the Challenges and Opportunities of Policy Learning for Mobile Manipulation
[Recording]
Prof. Jeannette Bohg | Stanford University |
|
18:00 - 18:20 | |||||
| 10:25 - 10:45 | Coffee Break and Poster Session | 20 Mins | 18:25 - 18:45 | |||||
| 11:00 - 11:20 |
LLM-State: Adaptive State Representation for Long-Horizon Task Planning in the Open World
[Recording]
Prof. David Hsu | National University of Singapore |
|
19:00 - 19:20 | |||||
| 11:20 - 11:40 |
LLMs for System 1 Generalization
[Recording]
Prof. Yuke Zhu | University of Texas at Austin |
|
19:20 - 19:40 | |||||
| 11:40 - 12:00 |
Panel: Bridging the Gap between Research & Industry
Moderator: Naoki Wake, Microsoft Research
|
19:40 - 20:00 | ||||||
| 12:00 - 12:20 |
|
20:00 - 20:20 | ||||||
| 12:40 - 13:30 | Lunch Break | 50 min | 20:40 - 21:30 | |||||
| 13:30 - 13:50 |
Language as Bridge for Sim2Real
[Recording]
Prof. Roberto Martín-Martín | University of Texas at Austin |
|
21:30 - 21:50 | |||||
| 13:50 - 14:10 |
Foundation Models of and for Navigation
[Recording]
Dhruv Shah | University of California, Berkeley |
|
21:50 - 22:10 | |||||
| 14:15 - 14:35 |
BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation
[Recording]
Dr. Ruohan Zhang | Stanford University |
|
22:15 - 22:35 | |||||
| 14:35 - 15:05 |
Spotlight Talks (six)
[Recording] |
22:35 - 23:05 | ||||||
| 15:10 - 16:00 | Coffee Break and a Longer Poster Session | 1 Hour | 23:10 - 24:00 | |||||
| 16:00 - 16:40 |
Debate: Is Large Foundation Models the most important research topic in the next 5 years? And various other questions.
Moderator: Nur Muhammad Mahi Shafiullah, New York University
*The organizers may or may not be serious about this special guest. |
00:00 - 00:40 | ||||||
| 16:40 - 16:55 |
Moderated Open Discussion: What’s Down the Horizon? / The 1 Billion Dollar Proposal |
All in-person attendees | 00:40 - 00:55 | |||||
| 16:55 - 17:00 | Best Paper Awards Ceremony and Closing Remarks | 00:55 - 01:00 | ||||||
| ↑ Time (JST) | ↑ Event | ↑ Time (PDT) (1 Day Earlier) |
||||||
Location
FAQ
Are you going to record the talks and post them later on YouTube?
We’re going to post the talks of speakers who permit us to on YouTube. But we will NOT post the recordings of the panel discussion, the debate, or the open discussion at the end.
Can I present remotely if my paper is accepted as a poster or a spotlight talk?
We will play a pre-recording of your spotlight talk and we will strongly encourage you to find friends to help present the poster in person.
Call for Papers
- Applications:
- Integration of VLM/LLMs for manipulation and navigation
- VLM/LLMs for perception/scene understanding/state estimation
- VLM/LLMs for control/skill learning/motion generation
- VLM/LLMs for decision-making/reasoning/planning
- VLM/LLMs as world models
- VLMs/LLMs for multimodal task specifications
- VLMs/LLMs for human-robot/robot-robot interactions
- VLMs/LLMs for scene and task generation
- New Capabilities:
- Open-vocabulary perception/navigation/manipulation
- Commonsense reasoning with VLM/LLMs
- Generalization to unseen object categories, environments, and tasks
- Bootstrapping learning from scarce data
- Natural language interaction with everyday users
- Datasets/Benchmarks:
- Internet-scale data for training robotics foundation models
- Mobile manipulation benchmarks for VLM/LLM-based systems
- Limitations:
- Failure modes of VLM/LLMs
- Robustness of VLM/LLMs
- Certifiabilities of VLM/LLMs
Important Dates:
- Submission portal opens:
January 29, 2024 - Paper submission deadline:
March 11, Monday, 2024 (AoE) - Notification of acceptance:
March 29, 2024 (Results viewable on OpenReview)April 1, 2024 (Announcing Spotlights) - Camera-ready deadline:
April 26, 2024 - Workshop @ ICRA 2024: May 17, 2024
Organizers
Chris Paxton FAIR, Meta |
Fei Xia Google Deepmind |
Karmesh Yadav Georgia Tech |
Nur Muhammad Mahi Shafiullah New York University |
Naoki Wake Microsoft Research |
Weiyu Liu Stanford University |
Yujin Tang Sakana AI |
Zhutian Yang MIT, NVIDIA Research |





















