| CARVIEW |
STRIVE: Structured Representation Integrating VLM Reasoning for Efficient Object Navigation
Contributions
- We propose a framework that incrementally builds a structured representation of the environment, enabling the VLM to make more informed decisions.
- We design an efficient two-stage navigation policy based on this representation, combining high-level planning guided by the VLM's reasoning and low-level exploration with VLM's assistance.
- STRIVE achieves state-of-the-art performance on simulated benchmarks (HM3D, RoboTHOR, MP3D) and shows strong performance in diverse and complex real-world environments.
Video
Abstract
We evaluated our approach on three simulated benchmarks (HM3D, RoboTHOR, and MP3D), and achieved state-of-the-art performance on both the success rate (↑ 7.1%) and navigation efficiency (↑ 12.5%). We further validate our method on a real robot platform, demonstrating strong robustness across 15 object navigation tasks in 10 different indoor environments.
Method Overview
Overview of STRIVE. We construct a multi-layer representation R on-the-fly, consisting of object, viewpoint, and room nodes, which serves as a structured input for VLM. Based on R, we introduce a two-stage navigation policy, where the VLM reasons and plans at room-level, while the agent explores in room at the viewpoint-level using a VLM-assisted frontier-based navigation strategy and VLM-based target verification.
Benchmark Results
Comparison with SOTA methods with different settings on HM3D, RoboTHOR, and MP3D datasets. We report the Success Rate (SR) and Success weighted by Path Length (SPL) metrics.
Benchmark Results
Qualitative visualization of STRIVE. The first and second steps show the VLM’s reasoning process, where it selects Room 6 and 9 by jointly considering room-layout ('doorway'), semantic cues ('nightstand') and travel cost (penalized distance). The final step shows VLM-based verification, using contextual cues (e.g., mattress, pillows) to confirm the target object as a ‘bed’.
Real-world Experiments
Experiments on HM3D
Experiments on RoboTHOR
Experiments on MP3D
BibTeX
@misc{zhu2025strivestructuredrepresentationintegrating,
title={STRIVE: Structured Representation Integrating VLM Reasoning for Efficient Object Navigation},
author={Haokun Zhu and Zongtai Li and Zhixuan Liu and Wenshan Wang and Ji Zhang and Jonathan Francis and Jean Oh},
year={2025},
eprint={2505.06729},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2505.06729},
}