[💻 Project page] [📄 Paper]
Puhao Li1,2, Yingying Wu1, Ziheng Xi1, Wanlin Li2, Yuzhe Huang2, Zhiyuan Zhang1, Yinghan Chen3, Jianan Wang4, Song-Chun Zhu1,2,3, Tengyu Liu2, ✉️, Siyuan Huang2, ✉️
1Tsinghua University, 2Beijing Institute for General Artificial Intelligence (BIGAI), 3Peking University, 4AstriBot.
ControlVLA is a general framework for few-shot object-centric adaptation for pre-trained VLA models. It can be used to adapt pre-trained VLA models to task- and environment-specific skills with only 10-20 expert demonstrations.
-
Create a virtual environment through
condaor other python package managers.conda create -n controlvla python==3.9.18 conda activate controlvla
-
Install
torchand other dependent libraries.pip install torch==2.1.0+cu121 torchvision==0.16.0+cu121 --extra-index-url https://download.pytorch.org/whl/cu121 pip install -r requirements.txt ## install sam2 from the source code cd thirdparty git clone https://github.com/facebookresearch/sam2.git git checkout 7e1596c0b6462eb1d1ba7e1492430fed95023598 ## remove the python and pytorch version restrictions in sam2 setup config cd sam2 && pip install -e .
- The code is tested on
pytorch 2.1.0andcuda 12.1, other versions may have compatibility issues.
- The code is tested on
-
Download the pre-trained model:
- Pre-trained ControlVLA model here, unzip and place it in the
./datafolder. - SAM2 model following the instructions here and place it in the
./data/checkpointsfolder. Note that the default config uses checkpointsam2_hiera_tiny.pt. You can simply download the default checkpoint with wget:cd data/checkpoints && wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt.
- Pre-trained ControlVLA model here, unzip and place it in the
Note that our data is collected and pre-processed with UMI system, which provides robot-arm-agnostic training data. You may also collect data with your own robot setup for faster validation and deployment.
-
Collect data with UMI data pipeline system, and get the
replay_buffer.zarr.zipdata file. An example of this zarr data file is provided here. -
Annotate the interactive parts for each object for SAM2.
python scripts_objcen_pipeline/prompts_annotation.py -i ./example_finetune_demo/picknplace_toy.d10 python scripts_objcen_pipeline/prompts_extraction.py -i picknplace_toy.d10
You can also use GroundingDINO to automatically annotate the interactive parts with task language instructions.
-
Process and integrate the object-centric masks into the data file.
python scripts_objcen_pipeline/08_propagate_interactive_parts.py -i ./example_finetune_demo/picknplace_toy.d10 python scripts_objcen_pipeline/09_integrate_into_dataset.py -i ./example_finetune_demo/picknplace_toy.d10
Finetune the pre-trained ControlVLA model on example dataset with the following command:
bash runs/controlvla_pnptoy.shFor real-world deployment, customize your robot and camera interface for the inference script eval_controlvla.py. Then run:
python eval_controlvla.py -i ./data/checkpoints/latest.ckpt -p ./example_finetune_demo/picknplace_toy.d10/picknplace_toy.d10.objectcentric.anno.pklWe thank Yuyang Li, Yuwei Guo, and Ziyuan Jiao for their valuable discussions and technical support. This work builds upon the codebase of UMI.
If you find this work useful, please consider citing:
@article{li2025controlvla,
title={ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models},
author={Li, Puhao and Wu, Yingying and Xi, Ziheng and Li, Wanlin and Huang, Yuzhe and Zhang, Zhiyuan and Chen, Yinghan and Wang, Jianan and Zhu, Song-Chun and Liu, Tengyu and others},
journal={arXiv preprint arXiv:2506.16211},
year={2025}
}If you have any questions about this work, feel free to contact Puhao Li at puhaoli01@gmail.com