| CARVIEW |
DynaMem
Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation
Peiqi Liu, Zhanqiu Guo, Mohit Warke, Soumith Chintala, Chris Paxton, Nur Muhammad "Mahi" Shafiullah†, Lerrel Pinto†
†: Equal advising
Significant progress has been made in open-vocabulary mobile manipulation, where the goal is for a robot to perform tasks in any environment given a natural language description. However, most current systems assume a static environment, which limits the system's applicability in real- world scenarios where environments frequently change due to human intervention or the robot's own actions. In this work, we present DynaMem, a new approach to open-world mobile manipulation that uses a dynamic spatio-semantic memory to represent a robot's environment. DynaMem constructs a 3D data structure to maintain a dynamic memory of point clouds, and answers open-vocabulary object localization queries using multimodal LLMs or open-vocabulary features generated by state-of-the-art vision-language models. Powered by DynaMem, our robots can explore novel environments, search for objects not found in memory, and continuously update the memory as objects move, appear, or disappear in the scene. We run extensive experiments on the Stretch SE3 robots in three real and nine offline scenes, and achieve an average pick-and-drop success rate of 70% on non-stationary objects, which is more than a 2x improvement over state-of-the-art static systems.
Videos
DynaMem in action
Here are sample trials from 3 lab environments and 2 home environments.
Method
Illustration of DynaMem
We maintain a feature pointcloud as the robot memory. When the robot receives a new RGBD observation of the environment, it adds the newly observed objects and removes the points no longer existing.
To ground the object of interest described by the text query, the robot locates the point most similar to text query along with the last image it is observed. If the text is grounded in the image or the point has high similarity with the text, it will be considered as the location of the object of interest.
If the text is grounded the environment, the robot will navigate to the target object; otherwise, the robot memory will be projected into a value map and the robot explores the environment based on the value map.
Evaluation
Performance of DynaMem
We evaluate DynaMem in 3 different environments, 10 queries from each environment. We select OK-Robot (with prescanned static robot memory) and Gemini (utilized following the pipeline proposed in OpenEQA) as baselines.
We find that both DynaMem and mLLM have a total success rate of 70%. This is a significant improvement over the OK-Robot system, which has a total success rate of 30%. Notably, DynaMem is particularly adept at handling dynamic objects in the environment: only 6.7% of the trials failed due to our system not being able to navigate to such dynamic objects in the scene.
Paper
DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulations
@article{liu2024dynamem,
title={DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation},
author={Liu, Peiqi and Guo, Zhanqiu and Warke, Mohit and Chintala, Soumith and Shafiullah, Nur Muhammad Mahi and Pinto, Lerrel},
journal={arXiv preprint arXiv:2411.04999},
year={2024}
}