| CARVIEW |
Select Language
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Sat, 19 Jul 2025 01:30:16 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"687af528-7d3f"
expires: Sun, 28 Dec 2025 23:12:18 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 1831:21D6A4:823962:920EDB:6951B6FA
accept-ranges: bytes
date: Sun, 28 Dec 2025 23:02:18 GMT
via: 1.1 varnish
age: 0
x-served-by: cache-bom-vanm7210075-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766962938.339302,VS0,VE209
vary: Accept-Encoding
x-fastly-request-id: 70c5632f3f5d404d365e0b28a5d409d513f84df7
content-length: 5259
NaVILA: Legged Robot Vision-Language-Action Model for Navigation
The vision-based policy performs better in Success Rate than the blind
policy
because it avoids obstacles more
effectively.
NaVILA: Legged Robot Vision-Language-Action Model for Navigation
RSS 2025
UC San Diego, USC, NVIDIA
*Equal Contribution, ordered
alphabetically
† Equal Advising
† Equal Advising
TL;DR
NaVILA is a two-level framework that combines VLAs with locomotion skills for navigation. It generates high-level language-based commands, while a real-time locomotion policy ensures obstacle avoidance.
🔥 Highlights
- Strongly Generalizable VLA Model. We tamed a VLM model into a VLA model and trained it on diverse datasets, including real-world human video data, simulated indoor navigation data, and question-answering tasks. NaVILA outperform all methods that do not rely on simulator pre-trained waypoint predictors, even when those methods leverage additional sensors.
- Vision-Based Legged Robot Policy. We introduce an end-to-end vsion-based locomotion policy trained without teacher-student distillation. By directly interacting with the environment using LiDAR during training, our approach significantly reduces the sim-to-real gap. Our policy ensures safety in challenging environments, such as near transparency surfaces, and excels at traversing rough terrain.
- VLN-CE-Isaac Benchmark. We introduce a high-fidelity, physics-realistic benchmark for low-level robotic control. It provides a safe, scalable platform to evaluate navigation across diverse robots and scenarios, reducing real-world testing costs and risks.
- Real-World Deployment. NaVILA demonstrates strong performance in challenging
real-world environments with quadruped and humanoid robots,
showcasing its generalization capabilities and robustness.
Real-world Results: Booster T1
Real-world Results: Unitree Go2
Real-world Results: Unitree G1
Learning from YouTube Human Touring Videos
Results: VLN-CE-Isaac NaVILA-Go2-Vision
Results: VLN-CE-Issac NaVILA-H1-Vision
Results: VLN-CE-Issac Vision vs Blind Policy
Results: R2R-CE (Habitat)
Results: RxR-CE (Habitat)
Citation
@inproceedings{cheng2024navila,
title = {NaVILA: Legged Robot Vision-Language-Action Model for Navigation},
author = {Cheng, An-Chieh and Ji, Yandong and Yang, Zhaojing and Zou, Xueyan and Kautz, Jan and Biyik, Erdem and Yin,
Hongxu and Liu, Sifei and Wang, Xiaolong},
booktitle = {RSS},
year = {2025},
}
Acknowledgement
We sincerely thank Chengjing Yuan for their support with hardware setup and 3D modeling. We also thank Xuxin Cheng and Jialong Li for their help in setting up the G1 robot, as well as Jiazhao Zhang and Yukang Chen for their valuable discussions.