| CARVIEW |
OK-Robot
An open, modular framework for zero-shot, language conditioned pick-and-drop tasks in arbitrary homes.
Peiqi Liu*, Yaswanth Orru*, Jay Vakil, Chris Paxton, Nur Muhammad "Mahi" Shafiullah†, Lerrel Pinto†
*: Equal contributions, †: Equal advising
Remarkable progress has been made in recent years in the fields of vision, language, and robotics. We now have vision models capable of recognizing objects based on language queries, navigation systems that can effectively control mobile systems, and grasping models that can handle a wide range of objects. Despite these advancements, general-purpose applications of robotics still lag behind, even though they rely on these fundamental capabilities of recognition, navigation, and grasping. In this paper, we adopt a systems-first approach to develop a new Open Knowledge-based robotics framework called OK-Robot. By combining Vision-Language Models (VLMs) for object detection, navigation primitives for movement, and grasping primitives for object manipulation, OK-Robot offers a integrated solution for pick-and-drop operations without requiring any training. To evaluate its performance, we run OK-Robot in 10 real-world home environments. The results demonstrate that OK-Robot achieves a 58.5% success rate in open-ended pick-and-drop tasks, representing a new state-of-the-art in Open Vocabulary Mobile Manipulation (OVMM) with nearly 1.8x the performance of prior work. On cleaner, uncluttered environments, OK-Robot's performance increases to 82%. However, the most important insight gained from OK-Robot is the critical role of nuanced details when combining Open Knowledge systems like VLMs with robotic modules.
Videos
OK-Robot in action
In 10 home environments of New York City, OK-Robot attempted 171 pick-and-drop tasks. Here are sample trials from 5 homes, each showing 5 tasks.
Analysis
Understanding the performance of OK-Robot
While our method can show zero-shot generalization in completely new environments, we probe OK-Robot to better understand when and how it succeeds and fails. While we find a 58.5% success rate at completely novel homes, at a closer look, we also notice a long tail of failure causes, which is presented in the figure above. We see that the leading three cause of failures are failing to retrieve the right object to navigate to from the semantic memory (9.3%), getting a difficult pose from the manipulation module (8.0%), and hardware difficulties (7.5%).
In the "Understanding the performance of OK-Robot" section of the paper, we go over the analysis of the failure modes presented in the figure above and discuss the most frequent cases.
Paper
OK-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics
@article{liu2024okrobot,
title={OK-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics},
author={Liu, Peiqi and Orru, Yaswanth and Paxton, Chris and Shafiullah, Nur Muhammad Mahi and Pinto, Lerrel},
journal={arXiv preprint arXiv:2401.12202},
year={2024}
}