CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Fri, 28 Jun 2024 19:57:49 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"667f15bd-84da" expires: Sun, 28 Dec 2025 12:24:03 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 1826:1387E:78BB08:876C1C:69511F02 accept-ranges: bytes age: 0 date: Sun, 28 Dec 2025 12:14:03 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210084-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766924043.981872,VS0,VE204 vary: Accept-Encoding x-fastly-request-id: 5d52addfb1f2087a017554a4ccd27c9f70e761d1 content-length: 7657 VLMNM Workshop @ ICRA 2024

Vision-Language Models for Navigation and Manipulation (VLMNM)

Full-day hybrid workshop at ICRA 2024, Room 315, Yokohama (Japan)

Friday, May 17, 9 am - 5 pm (JST)

Recordings of invited talks can be found in our YouTube Channel

With the rising capabilities of LLMs and VLMs, the past two years have seen a surge in research work using VLMs for navigation and manipulation. Fusing the capabilities of visual interpretation with natural language processing, these models are poised to redefine how robotic systems interact with both their environment and human counterparts. The relevance of this topic cannot be overstated; as the frontier of human-robot interaction expands, so does the necessity for robots to comprehend and operate within complex environments using naturalistic instructions. Our workshop will not only reflect the state-of-the-art advancements in this domain, by featuring a diverse set of speakers, from senior academics to researchers in early careers, from industry researchers to companies producing mobile manipulation platforms, from researchers who are enthusiastic about using VLMs for robotics to those who have reservations about it. We aim for this event to be a catalyst for originality and diversity at ICRA 2024. We believe that, amidst a sea of workshops, ours will provide unique perspectives that will push the boundaries of what's achievable in robot navigation and manipulation.

In this workshop, we plan to discuss:

How can VLMs/LLMs enhance robotics navigation and manipulation?
How to extract world knowledge from pre-trained VLMs/LLMs and apply them to navigation and manipulation?
How to integrate VLMs/LLMs with robot components, such as perception, control, and planning? How to account for partial observability and uncertainty?
Benchmarks and datasets to assess the generalization capabilities of VLMs/LLMs for navigation and manipulation.
Capabilities and limitations of VLMs/LLMs for navigation and manipulation (e.g. in task planning, spatial understanding)
New interaction modes between robots and humans enabled by VLMs/LLMs.

All accepted workshop papers: OpenReview. Please bring a poster. We will have double-sided A0 portrait poster boards.

Final Schedule

Time (JST)

Event

Description

Time (PDT)
(May 16)

8:30 - 8:50

Coffee and Pasteries
Poster presenters set up posters

16:30 - 16:50

                8:50 - 9:00
                
                  Introduction
                
                16:50 - 17:00

9:10 - 9:35

LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks [Recording]

Prof. Subbarao Kambhampati | Arizona State University

17:10 - 17:35

9:35 - 10:00

LLM-based Task and Motion Planning for Robots [Recording]

Prof. Chuchu Fan | Massachusetts Institute of Technology

17:35 - 18:00

10:00 - 10:20

On the Challenges and Opportunities of Policy Learning for Mobile Manipulation [Recording]

Prof. Jeannette Bohg | Stanford University

18:00 - 18:20

                10:25 - 10:45
                
                  Coffee Break and Poster Session
                
                  20 Mins
                
                18:25 - 18:45

11:00 - 11:20

LLM-State: Adaptive State Representation for Long-Horizon Task Planning in the Open World [Recording]

Prof. David Hsu | National University of Singapore

19:00 - 19:20

11:20 - 11:40

LLMs for System 1 Generalization [Recording]

Prof. Yuke Zhu | University of Texas at Austin

19:20 - 19:40

11:40 - 12:00

Panel: Bridging the Gap between Research & Industry
Moderator: Naoki Wake, Microsoft Research

Chris Paxton
| Hello Robot

Takafumi Watanabe
| Preferred Robotics Inc.

Dr. Mohit Shridhar
| Dyson Robot Learning Lab

Prof. Lerrel Pinto
| NYU Courant

19:40 - 20:00

12:00 - 12:20

Demo: a Chat with Kachaka, a Home Robot

Takafumi Watanabe, Kenichi Hidai
| Preferred Robotics Inc.

20:00 - 20:20

12:40 - 13:30

Lunch Break

50 min

20:40 - 21:30

13:30 - 13:50

Language as Bridge for Sim2Real [Recording]

Prof. Roberto Martín-Martín | University of Texas at Austin

21:30 - 21:50

13:50 - 14:10

Foundation Models of and for Navigation [Recording]

Dhruv Shah | University of California, Berkeley

21:50 - 22:10

14:15 - 14:35

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation [Recording]

Dr. Ruohan Zhang | Stanford University

22:15 - 22:35

14:35 - 15:05

Spotlight Talks (six)

RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation
Ruihai Wu | Peking University
MOSAIC: A Modular System for Assistive and Interactive Cooking
Kushal Kedia | Cornell Univeristy
CoPa: General Robotic Manipulation through Spatial Constraints of Parts with Foundation Models
Haoxu Huang | Shanghai Qizhi Institute
Deploying and Evaluating LLMs to Program Service Mobile Robots
Zichao Hu | UT Austin
Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation<
Daniel Honerkamp | University of Freiburg
Language Models as Zero-Shot Trajectory Generators
Norman Di Palo | Imperial College London

[Recording]

22:35 - 23:05

                15:10 - 16:00
                
                  Coffee Break and a Longer Poster Session
                
                  1 Hour
                
                23:10 - 24:00

16:00 - 16:40

Debate: Is Large Foundation Models the most important research topic in the next 5 years? And various other questions.
Moderator: Nur Muhammad Mahi Shafiullah, New York University

Prof. Roberto Martín-Martín
| UT Austin

Dhruv Shah
| Berkeley

Ted Xiao
| Google DeepMind

Lin Shao
| NUS

GPT-4-o ^*
| Open AI

^*The organizers may or may not be serious about this special guest.

00:00 - 00:40

                16:40 - 16:55
                
                  Moderated Open Discussion: 
 What’s Down the Horizon? / The 1 Billion Dollar Proposal
                
                  All in-person attendees
                
                00:40 - 00:55

16:55 - 17:00

Best Paper Awards Ceremony and Closing Remarks

00:55 - 01:00

↑ Time (JST)

↑ Event

↑ Time (PDT)
(1 Day Earlier)

Location

FAQ

Are you going to record the talks and post them later on YouTube?

We’re going to post the talks of speakers who permit us to on YouTube. But we will NOT post the recordings of the panel discussion, the debate, or the open discussion at the end.

Can I present remotely if my paper is accepted as a poster or a spotlight talk?

We will play a pre-recording of your spotlight talk and we will strongly encourage you to find friends to help present the poster in person.

Call for Papers

We invite submissions including but not limited to the following topics:

Applications:

Integration of VLM/LLMs for manipulation and navigation
VLM/LLMs for perception/scene understanding/state estimation
VLM/LLMs for control/skill learning/motion generation
VLM/LLMs for decision-making/reasoning/planning
VLM/LLMs as world models
VLMs/LLMs for multimodal task specifications
VLMs/LLMs for human-robot/robot-robot interactions
VLMs/LLMs for scene and task generation

New Capabilities:

Open-vocabulary perception/navigation/manipulation
Commonsense reasoning with VLM/LLMs
Generalization to unseen object categories, environments, and tasks
Bootstrapping learning from scarce data
Natural language interaction with everyday users

Datasets/Benchmarks:

Internet-scale data for training robotics foundation models
Mobile manipulation benchmarks for VLM/LLM-based systems

Limitations:

Failure modes of VLM/LLMs
Robustness of VLM/LLMs
Certifiabilities of VLM/LLMs

Submissions should have up to 4 pages of technical content, with no limit on references/appendices. Submissions are suggested to follow the ICRA double-column format with the template available here. We encourage authors to upload videos, code, or data as supplementary material (due on the same day as the paper). Following the main conference, our workshop will use a single-blind review process. We welcome both unpublished, original contributions and recently published relevant works. Accepted papers will be presented as posters or orals and made public via the workshop’s OpenReview page with the authors’ consent. We strongly encourage at least one of the authors to present on-site during the workshop. Our workshop will feature a Best Paper Award.
Important Dates:

Submission portal opens: ~~January 29, 2024~~
Paper submission deadline: ~~March 11, Monday, 2024 (AoE)~~
Notification of acceptance: ~~March 29, 2024 (Results viewable on OpenReview)~~ April 1, 2024 (Announcing Spotlights)
Camera-ready deadline: ~~April 26, 2024~~
Workshop @ ICRA 2024: May 17, 2024

Organizers

Chris Paxton FAIR, Meta	Fei Xia Google Deepmind	Karmesh Yadav Georgia Tech	Nur Muhammad Mahi Shafiullah New York University
Naoki Wake Microsoft Research	Weiyu Liu Stanford University	Yujin Tang Sakana AI	Zhutian Yang MIT, NVIDIA Research

Contact

For further information or questions, please contact vlm-navigation-manipulation-workshop [AT] googlegroups [DOT] com

Access Map

Original Source | Taken Source