CARVIEW

MOTORHOMES

Select Language

HTTP/2 301 server: GitHub.com content-type: text/html location: https://zwandering.github.io/STRIVE.github.io/ x-github-request-id: 23F2:3ABDEF:A0CBE2:B4AAE7:6953CF36 accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 13:10:14 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210095-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767100214.167540,VS0,VE200 vary: Accept-Encoding x-fastly-request-id: eddefe393e9666192600bd44e02847bfae8afa14 content-length: 162 HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Mon, 15 Sep 2025 21:44:21 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"68c888b5-744a" expires: Tue, 30 Dec 2025 13:20:14 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 89BB:123DE:A179BE:B55722:6953CF36 accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 13:10:14 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210095-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767100214.380889,VS0,VE202 vary: Accept-Encoding x-fastly-request-id: 9bcfa0d2f0e79d450eba3c9ac1855a326c586963 content-length: 5891 STRIVE

STRIVE: Structured Representation Integrating VLM Reasoning for Efficient Object Navigation

Haokun Zhu^{1 *}, Zongtai Li^{1 *}, Zhixuan Liu¹, Wenshan Wang¹, Ji Zhang¹, Jonathan Francis^1,2, Jean Oh¹

¹Carnegie Mellon University, ²Bosch Center for AI

Paper arXiv Code (Coming Soon)

Contributions

We propose a framework that incrementally builds a structured representation of the environment, enabling the VLM to make more informed decisions.
We design an efficient two-stage navigation policy based on this representation, combining high-level planning guided by the VLM's reasoning and low-level exploration with VLM's assistance.
STRIVE achieves state-of-the-art performance on simulated benchmarks (HM3D, RoboTHOR, MP3D) and shows strong performance in diverse and complex real-world environments.

Video

Abstract

We propose a novel framework that constructs a multi-layer representation of the environment during navigation. This representation consists of viewpoint, object nodes, and room nodes. Viewpoints and object nodes facilitate intra-room exploration and accurate target localization, while room nodes support efficient inter-room planning. Building on this representation, we propose a novel two-stage navigation policy, integrating high-level planning guided by VLM rea- soning with low-level VLM-assisted exploration to efficiently locate a goal object.

We evaluated our approach on three simulated benchmarks (HM3D, RoboTHOR, and MP3D), and achieved state-of-the-art performance on both the success rate (↑ 7.1%) and navigation efficiency (↑ 12.5%). We further validate our method on a real robot platform, demonstrating strong robustness across 15 object navigation tasks in 10 different indoor environments.

Method Overview

Overview of STRIVE. We construct a multi-layer representation R on-the-fly, consisting of object, viewpoint, and room nodes, which serves as a structured input for VLM. Based on R, we introduce a two-stage navigation policy, where the VLM reasons and plans at room-level, while the agent explores in room at the viewpoint-level using a VLM-assisted frontier-based navigation strategy and VLM-based target verification.

Benchmark Results

Comparison with SOTA methods with different settings on HM3D, RoboTHOR, and MP3D datasets. We report the Success Rate (SR) and Success weighted by Path Length (SPL) metrics.

Benchmark Results

Qualitative visualization of STRIVE. The first and second steps show the VLM’s reasoning process, where it selects Room 6 and 9 by jointly considering room-layout ('doorway'), semantic cues ('nightstand') and travel cost (penalized distance). The final step shows VLM-based verification, using contextual cues (e.g., mattress, pillows) to confirm the target object as a ‘bed’.

Real-world Experiments

Experiments on HM3D

Experiments on RoboTHOR

Experiments on MP3D

BibTeX

@misc{zhu2025strivestructuredrepresentationintegrating,
      title={STRIVE: Structured Representation Integrating VLM Reasoning for Efficient Object Navigation}, 
      author={Haokun Zhu and Zongtai Li and Zhixuan Liu and Wenshan Wang and Ji Zhang and Jonathan Francis and Jean Oh},
      year={2025},
      eprint={2505.06729},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2505.06729}, 
    }

Original Source | Taken Source