Cheng-Chun Hsu, Bowen Wen, Jie Xu, Yashraj Narang, Xiaolong Wang, Yuke Zhu, Joydeep Biswas, Stan Birchfield
ICRA 2025
We introduce SPOT, an object-centric imitation learning framework. The key idea is to capture each task by an object-centric representation, specifically the SE(3) object pose trajectory relative to the target. This approach decouples embodiment actions from sensory inputs, facilitating learning from various demonstration types, including both action-based and action-less human hand demonstrations, as well as cross-embodiment generalization.
Additionally, object pose trajectories inherently capture planning constraints from demonstrations without the need for manually-crafted rules. To guide the robot in executing the task, the object trajectory is used to condition a diffusion policy. We systematically evaluate our method on simulation and real-world tasks. In real-world evaluation, using only eight demonstrations shot on an iPhone, our approach completed all tasks while fully complying with task constraints.
The codebase is thoroughly tested on a desktop running Ubuntu 22 with an RTX 4090 GPU.
Create conda environment
conda create -n spot python=3.8
conda activate spot
Install dependencies (for FoundationPose compilation)
# Install eigen library
conda install conda-forge::eigen=3.4.0
# Install gcc and cuda
conda install gcc_linux-64 gxx_linux-64
conda install cuda -c nvidia/label/cuda-12.1.0
conda install nvidia/label/cuda-12.1.0::cuda-cudart
conda install cmake
# Install boost library
sudo apt install libboost-all-dev
conda install conda-forge::boost
Install Pytorch and Pytorch3d
conda install pytorch==2.3.1 torchvision==0.18.1 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install pytorch3d -c pytorch3d
Install CoppeliaSim v4.1.0 (see here for details)
# set env variables
export COPPELIASIM_ROOT=${HOME}/CoppeliaSim
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$COPPELIASIM_ROOT
export QT_QPA_PLATFORM_PLUGIN_PATH=$COPPELIASIM_ROOT
wget https://downloads.coppeliarobotics.com/V4_1_0/CoppeliaSim_Edu_V4_1_0_Ubuntu20_04.tar.xz
mkdir -p $COPPELIASIM_ROOT && tar -xf CoppeliaSim_Edu_V4_1_0_Ubuntu20_04.tar.xz -C $COPPELIASIM_ROOT --strip-components 1
rm -rf CoppeliaSim_Edu_V4_1_0_Ubuntu20_04.tar.xz
Install PyRep, YARR, and RLBench (PerAct's branch)
git clone https://github.com/MohitShridhar/PyRep.git
cd PyRep
pip3 install -r requirements.txt
pip3 install .
cd ..
git clone https://github.com/stepjam/YARR.git
cd YARR
pip3 install -r requirements.txt
pip3 install .
cd ..
git clone https://github.com/MohitShridhar/RLBench.git -b peract
cd RLBench
pip3 install -r requirements.txt
pip3 install .
cd ..
Install python depdencies
pip install -r fp_requirements.txt
python -m pip install --quiet --no-cache-dir git+https://github.com/NVlabs/nvdiffrast.git
pip install -r dp3_requirements.txt
Compile FoundationPose's extensions
cd foundation_pose
CMAKE_PREFIX_PATH=$CONDA_PREFIX/lib/python3.8/site-packages/pybind11/share/cmake/pybind11 bash build_all_conda.sh
cd ..
Download model weight from here or Foundation Pose repo, and put the model weight under data/model_weight/foundation_pose.
Our dataset generation is based on PerAct's pre-generated datasets. We replay the demonstrations to collect object pose information for policy training.
Download PerAct's pre-generated datasets for train (100 episodes), validation (25 episodes), and test (25 episodes) splits (check PerAct's repo for details). The task list can be found in our paper.
For reference, I stored the dataset as:
[PERACT_DATASET_PATH]
└─ raw
└─ train
└─ [TASK_1]
└─ [TASK_2]
└─ ...
└─ val
└─ [TASK_1]
└─ [TASK_2]
└─ ...
└─ test
└─ [TASK_1]
└─ [TASK_2]
└─ ...
Download RLBench's object mesh or manually export the object mesh from CoppeliaSim.
Set up the arguments in scripts/gen_demonstration_rlbench.sh.
--peract_demo_dirspecifies the path to store PerAct's demo, e.g.,[PERACT_DATASET_PATH]/raw.--save_pathspecifies the path to store the generated dataset.
Run the script to collect demonstration from RLBench for all tasks.
bash scripts/gen_demonstration_rlbench.sh
- Set
dataset.root_dirinconfig/task/rlbench_multi.yamlto the path of generated demonstration, i.e.,--save_pathin the scriptscripts/gen_demonstration_rlbench.sh. - (Optional) Modified
self.task_listindiffusion_policy_3d/dataset/rlbench_dataset_list.pyif you want to select your own task suite. - (Optional) For single task training, set
dataset.root_dirinconfig/task/rlbench/[TASK_NAME].yamlinstead
Run the script for training:
# Train on all tasks
bash scripts/train_policy.sh rlbench_multi
# Train on single task
bash scripts/train_policy.sh rlbench/[TASK_NAME]
- Set
pose_estimation.mesh_dirinconfig/simple_dp3.yaml, ensuring the path leads to the downloaded RLBench mesh file. - (Optional) Modified task list in
scripts/eval_policy_multi.shif you want to select your own task suite. - (Optional) Set
env_runner.root_dirinconfig/task/rlbench/[TASK_NAME].yamlto the path of generated demonstration, i.e.,--save_pathin the scriptscripts/gen_demonstration_rlbench.sh.
Run the script for evaluation:
# Evaluate on all tasks
bash scripts/eval_policy_multi.sh
# Evaluate on single task
bash scripts/eval_policy.sh rlbench/[TASK_NAME]
- Note: The paper's results are based on an internal version of Foundation Pose that cannot be public released due to legal restrictions. Instead, we reference the public version of Foundation Pose. Our testing showed no performance degradation on the RLbench benchmark (See here for more information).
In this section, I describe my workflow for real-world experiments. This should serve only as a reference, and I recommend that readers use any tools they are familiar with. I used only one iPhone 12 Pro for the entire data collection process.
- This guide assumes that the conda environment
spothas been configured according to the instructions in Installation. - (Optional) Set up another conda environment named
yolo_worldfollowing YOLO-World Installation.- YOLO-World is used to obtain object masks during training (see the script
env_real/data/prepare_mask.py) and deployment. If you use a different object detection/segmentation model, you can ignore this step.
- YOLO-World is used to obtain object masks during training (see the script
For policy training and deployment, we need the following:
- Object mesh for pose tracking
- Task demonstration video for policy training
For each task, the object mesh is the reconstructed mesh of the graspable object (e.g., pitcher) and the target object (e.g., cup). The task demonstration is an RGBD video, where a human hand performs the task (e.g., pour water).
- Object mesh scanning
Use AR Code to scan both graspable and target objects. Export the mesh in
.usdzformat. Uncompress the.usdzfile to obtain the .usdc mesh file, then convert the.usdcfile to.objformat. I personally use Blender for this conversion process. - Task demonstration collection
Use Record3D to shoot a demonstration video of a single human hand performing the task. Use the export option "EXR + JPG sequence" to get the
.r3dfile.
After completing steps 1 and 2, place the mesh file in the mesh directory and the r3d file in the r3d directory. The files should be stored as follows:
[TASK_DATASET_PATH]
└─ mesh
└─ pitcher
└─ pitcher.obj
└─ cup
└─ cup.obj
└─ r3d
└─ 2024-09-07--01-11-23.r3d
└─ 2024-09-07--01-11-41.r3d
└─ ...
Set up the task name and object name using real_task_object_dict in env_real/utils/realworld_objects.py.
- The
grasp_object_nameandtarget_object_nameshould be consistent with the folder names under[TASK_DATASET_PATH]/mesh. - The
grasp_object_promptandtarget_object_promptare the prompts for the object detection/segmentation model (in this case, Yolo-World) to obtain the object bounding box/mask for tracking.
Run the script for dataset generation:
bash scripts/gen_demonstration_real.sh
The generated dataset will be saved under [TASK_DATASET_PATH]/zarr.
To train the policy, set dataset.root_dir to [TASK_DATASET_PATH] in the config file (see config/task/_real_world_task_template.yaml for details).
Run the script for training:
bash scripts/train_policy.sh [TASK_NAME]
ModuleNotFoundError: No module named 'rlbench.action_modes'
- Edit "setup.py" RLBench library and add 'rlbench.action_modes'. See here for more details.
The policy learning is based on 3D Diffusion Policy. The RLBench data collection and evaluation is based on RLBench and PerAct. The object pose tracking is based on FoundationPose. Thanks for their wonderful work.
The code and data are released under the NVIDIA Source Code License. Copyright © 2025, NVIDIA Corporation. All rights reserved.