| CARVIEW |
An Embodied Generalist Agent in 3D World
ICML 2024
Yan Wang1, Qing Li1, Song-Chun Zhu1,2,4, Baoxiong Jia1, Siyuan Huang1
2Peking University 3Carnegie Mellon University 4Tsinghua University
✶ indicates equal contribution
Model
Scene representation. The scene point cloud is partitioned into object-centric point clouds (either ground truth or predicted proposals), which are then processed by the 3D encoder to obtain object-centric features. We also incorporate an optional 2D branch, where a 2D encoder processes the agent's ego-view observation to obtain ego-centric features.
Unified sequence and objective. The sequence begins with a system message that tells the agent its role and situation. Subsequent 2D image tokens and 3D object tokens provide the perceived scene information. Next an instruction specifies the task or context, and also prompts for the final response. The learning objective is a simple auto-regressive loss.
Data
Two-stage scheme: alignment & instruction tuning. We combine existing datasets and LLM-prompted data to create LEO-align and LEO-instruct.
Demo
LEO's response in blue shade.
BibTeX
@inproceedings{huang2024embodied,
title={An Embodied Generalist Agent in 3D World},
author={Huang, Jiangyong and Yong, Silong and Ma, Xiaojian and Linghu, Xiongkun and Li, Puhao and Wang, Yan and Li, Qing and Zhu, Song-Chun and Jia, Baoxiong and Huang, Siyuan},
booktitle={Proceedings of the International Conference on Machine Learning (ICML)},
year={2024}
}