| CARVIEW |
Explore until Confident: Efficient Exploration for Embodied Question Answering
- Allen Z. Ren
- Jaden Clark
- Anushri Dixit
- Masha Itkina
- Anirudha Majumdar
- Dorsa Sadigh
Combine VLM semantic reasoning and rigorous uncertainty quantification to enable agents to efficiently explore relevant regions of unknown 3D environments, and stop to answer questions about them with calibrated confidence.
Is my kid on the treadmill?
A) Yes
B) No
Is the lamp next to the sofa turned on?
A) Yes
B) No
How many bedside tables are there in the bedroom with the white bedding?
A) Three
B) None
C) One
D) Two
Which rug did I put next to the kitchen sink?
A) There is no rug
B) White one
C) Gray one
D) Green one
I am going to shower now. I need to grab some towels.
A) There are already some in the bathroom
B) There are none in the bathroom
C) There are some in the bedroom
D) There is only one in the bathroom
I remember leaving some books in one of the rooms, on wooden shelves.
A) In the room with orange wall
B) In the room with white wall
C) In the room with green wall
Where did I leave the striped towel?
A) On the living room floor
B) In the bathroom
C) By the kitchen sink
D) On the dining table
Simulated scenarios in Habitat-Sim
What kind of stools are under the white board?
A) White ones
B) Dark blue ones
C) Black ones
D) Lime green ones
Is there something here that I can cook my cookie dough in?
A) Yes
B) No
Is the dishwasher in the kitchen open or closed?
A) Closed
B) Open
What kind of stools are under the white board?
A) White ones
B) Dark blue ones
C) Black ones
D) Lime green ones
Real-world scenarios with a Fetch mobile robot
Abstract
We consider the problem of Embodied Question Answering (EQA), which refers to settings where an embodied agent such as a robot needs to actively explore an environment to gather information until it is confident about the answer to a question. In this work, we leverage the strong semantic reasoning capabilities of large vision-language models (VLMs) to efficiently explore and answer such questions. However, there are two main challenges when using VLMs in EQA: they do not have an internal memory for mapping the scene to be able to plan how to explore over time, and their confidence can be miscalibrated and can cause the robot to prematurely stop exploration or over-explore. We propose a method that first builds a semantic map of the scene based on depth information and via visual prompting of a VLM β leveraging its vast knowledge of relevant regions of the scene for exploration. Next, we use conformal prediction to calibrate the VLM's question answering confidence, allowing the robot to know when to stop exploration β leading to a more calibrated and efficient exploration strategy. To test our framework in simulation, we also contribute a new EQA dataset with diverse, realistic human-robot scenarios and scenes built upon the Habitat-Matterport 3D Research Dataset (HM3D). Both simulated and real robot experiments show our proposed approach improves the performance and efficiency over baselines that do no leverage VLM for exploration or do not calibrate its confidence.
Embodied Question Answering (EQA)
In EQA tasks, the robot starts at a random location in a 3D scene, explores the space, and stops when it is confident about answering the question. This can be a challenging problem due to highly diverse scenes and lack of an a-priori map of the environment. Previous works rely on training dedicated exploration policies and question answering modules from scratch, which can be data-inefficient and only handle simple questions.
1) Limited Internal Memory of VLMs. EQA benefits from the robot tracking previously explored regions and also ones yet to be explored but relevant for answering the question. However, VLMs do not have an internal memory for mapping the scene and storing such semantic information;
2) Miscalibrated VLMs. VLMs are fine-tuned on pre-trained large language models (LLMs) as the language decoder, and LLMs are often miscalibrated - that is they can be over-confident or under-confident about the output. This makes it difficult to determine when the robot is confident enough about question answering in EQA and then stop exploration.
How can we endow VLMs with the capability of efficient exploration for EQA?
Addressing the first challenge of limited internal memory, we propose building a map of the scene external to the VLM as the robot visits different locations. On top of it, we embed the VLM's knowledge about possible exploration directions into this semantic map to guide the robot's exploration. Such semantic information is obtained by visual prompting: annotating the free space in the current image view, prompting the VLM to choose among the unoccupied regions, and querying its prediction. The values are then stored in the semantic map.
HM-EQA Dataset
While prior work has primarily considered synthetic scenes and simple questions such as βwhat is the color of the coffee table?β involving basic attributes of relatively large pieces of furniture, we are interested in applying our VLM-based framework in more realistic and diverse scenarios, where the question can be more open-ended and possibly require semantic reasoning. To this end, we propose HM-EQA, a new EQA dataset with 500 questions based on 267 scenes from the Habitat-Matterport 3D Research Dataset (HM3D). We consider five categories of questions:
Acknowledgements
We thank Donovon Jackson, Derick Seale, and Tony Nguyen for contributing to the HM-EQA dataset. The authors were partially supported by the Toyota Research Institute (TRI), the NSF CAREER Award [#2044149], and the Office of Naval Research [N00014-23-1-2148]. This article solely reflects the opinions and conclusions of its authors and NSF, ONR, TRI or any other Toyota entity. The website template is from KnowNo and Nerfies.