Example results on HOI4D dataset
![]() |
![]() |
![]() |
![]() |
|---|
Example results on Maniskill
![]() |
![]() |
![]() |
![]() |
![]() |
|---|
Example results on DROID dataset
![]() |
![]() |
![]() |
![]() |
|---|
git clone https://github.com/A-embodied/A0.git
cd A0
conda create -n a0env python=3.10.0
conda activate a0env
# Install pytorch
# Look up https://pytorch.org/get-started/previous-versions/ with your cuda version for a correct command
pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu121
# Install flash-attn
pip install flash-attn --no-build-isolation
# or install prebuilt flash-attn wheels for faster setup: https://github.com/mjun0812/flash-attention-prebuild-wheels
# Install other prerequisites
pip install -r requirements.txtLink the encoders to the repo directory:
# Under the root directory of this repo
mkdir -p google
mkdir -p Qwen
# Link the downloaded encoders to this repo
ln -s /path/to/Qwen2.5-7B Qwen/Qwen2.5-7B
ln -s /path/to/siglip-so400m-patch14-384 google/siglip-so400m-patch14-384Download A0-Dataset from Huggingfaceπ€ and unzip the zip files. Your dataset directory should look like:
βββ maniskill # maniskill_path
βββ droid-cotrack # droid_cotrack_path
βββ droid_molmo_sam2 # droid_molmo_sam2_path
βββ hoi4d_metadata # hoi4d_metadata_path
βββ hoi4d_frame # hoi4d_frame_selection_path
βββ HOI4D_release # hoi4d_rgb_path
Then set the dataset paths in configs/base.yaml:
# ...
dataset:
droid_cotrack_path: /path/to/droid_cotrack
droid_molmo_sam2_path: /path/to/droid_molmo_sam2
hoi4d_metadata_path: /path/to/hoi4d_metadata
hoi4d_rgb_path: /path/to/HOI4D_release
hoi4d_frame_selection_path: /path/to/hoi4d_frame
maniskill_path: /path/to/maniskillDecompose the videos of HOI4D_release dataset into images using ffmpeg via official Python script file decode.py:
python utils/decode.pyFirst, set some variables in the train.sh.
Run ifconfig to find your network interface, then export NCCL_SOCKET_IFNAME=<iface>.
Run ibstat to identify your InfiniBand device, then export NCCL_IB_HCA=<device:port>.
Set OUTPUT_DIR and CUDA_VISIBLE_DEVICES.
Optionally, you can download the model pre-trained on 1 million pixmo-points dataset: π€A0-1B-pretrain.
And set --pretrained_model_name_or_path to load it as initial parameters.
source train.sh- The default model configuration (hidden size: 2048, depth: 28) contains 1 billion parameters. By setting the
the hidden_sizeto 1024 and thedepthto 14 in configs/base.yaml, you can obtain a model with approximately 170 million parameters. - In our experiments, we used 2 GPU cards with a batch size of 100 and trained the model for 30,000 steps. The 170M model required 46 GB of memory per card. In comparison, the 1B model required 73 GB of memory per card.
You can test using your own trained model or the pre-trained model (π€A0-1B and A0-170M).
Set the variables PRETRAINED_MODEL_NAME_OR_PATH in test_dataset.sh
# test performance on Maniskill dataset
bash test_dataset.sh maniskill
# test performance on HOI4D Frame Seclection dataset
bash test_dataset.sh hoi4d_frame
# test performance on HOI4D dataset
bash test_dataset.sh hoi4d
# test performance on DROID dataset
bash test_dataset.sh droidYou can test using your own trained model or the pre-trained model (π€A0-1B and A0-170M).
# set keyword arguments --pretrained_model_name_or_path, --instruction and --image_path
bash inference.sh@article{xu2025a0,
title={A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation},
author={Rongtao Xu and Jian Zhang and Minghao Guo and Youpeng Wen and Haoting Yang and Min Lin and Jianzheng Huang and Zhe Li and Kaidong Zhang and Liqiong Wang and Yuxuan Kuang and Meng Cao and Feng Zheng and Xiaodan Liang},
journal={arXiv preprint arXiv:2504.12636},
year={2025},
}













