| CARVIEW |
Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V
*Indicates Equal Contribution
ICRA 2025
Abstract
Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback. In this work, we present COME-robot, the first closed-loop robotic system utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. robot incorporates two key innovative modules: (i) a multi-level open-vocabulary perception and situated reasoning module that enables effective exploration of the 3D environment and target object identification using commonsense knowledge and situated information, and (ii) an iterative closed-loop feedback and restoration mechanism that verifies task feasibility, monitors execution success, and traces failure causes across different modules for robust failure recovery. Through comprehensive experiments involving 8 challenging real-world mobile and tabletop manipulation tasks, COME-robot demonstrates a significant improvement in task success rate (~35%) compared to state-of-the-art methods. We further conduct comprehensive analyses to elucidate how COME-robot's design facilitates failure recovery, free-form instruction following, and long-horizon task planning.
Approach
A brief overview of COME-robot's workfow. Given a task instruction, COME-robot employs GPT-4V for reasoning and generates a code-based plan. Through feedback obtained from the robot's execution and interaction with the environment, it iteratively updates the subsequent plan or recovers from failures, ultimately accomplishing the given task.
COME-robot's planner has two key designs:
Open-Vocabulary Perception and Reasoning and
Closed Loop Feedback and restoration.
The former helps the robot ground open-ended instructions in real environment, and the latter guarantees task's completion.
Actions to be executed as reasoned by GPT-4V are highlighted in blue, identified failures are
highlighted in red, and analysis after observation or verification are highlighted in green.
Results
legged manipulation
mobile manipulation
tabletop manipulation
Cases of recover from failures
System Prompts
BibTeX
@misc{zhi2025closedloopopenvocabularymobilemanipulation,
title={Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V},
author={Peiyuan Zhi and Zhiyuan Zhang and Yu Zhao and Muzhi Han and Zeyu Zhang and Zhitian Li and Ziyuan Jiao and Baoxiong Jia and Siyuan Huang},
year={2025},
eprint={2404.10220},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2404.10220},
}