| CARVIEW |
* Equal contribution
Abstract
We propose Hand-Eye Autonomous Delivery (HEAD), a framework that learns navigation, locomotion, and reaching skills for humanoids, directly from human motion and vision perception data. We take a modular approach where the high-level planner commands the target position and orientation of the hands and eyes of the humanoid, delivered by the low-level policy that controls the whole-body movements. Specifically, the low-level whole-body controller learns to track the three points (eyes, left hand, and right hand) from existing large-scale human motion capture data while high-level policy learns from human data collected by Aria glasses. Our modular approach decouples the ego-centric vision perception from physical actions, promoting efficient learning and scalability to novel scenes. We evaluate our method both in simulation and in the real-world, demonstrating humanoid's capabilities to navigate and reach in complex environments designed for humans.
Robot system setup
Collecting navigation data
Navigation in different environment
Autonomous policy rollout.
Autonomous policy rollout
Autonomous policy rollout
Autonomous policy rollout
Failure mode
Citation
@article{chen2025hand,
title={Hand-Eye Autonomous Delivery: Learning Humanoid Navigation, Locomotion and Reaching},
author={Chen, Sirui and Ye, Yufei and Cao, Zi-Ang and Lew, Jennifer and Xu, Pei and Liu, C Karen},
journal={arXiv preprint arXiv:2508.03068},
year={2025}
}
Acknowledgements
This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project. It was adapted to be mobile responsive by Jason Zhang for PHOSA. The code can be found here.