OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning

[Project Page] [Paper] [Processed Datasets]

Fanqi Lin^1,2,3,5*, Ruiqian Nai^1,2,3,5*, Yingdong Hu^1,2,3*, Jiacheng You^1,2,3, Junming Zhao^1,4, Yang Gao^1,2,3,5

¹Tsinghua University, ²Shanghai Qi Zhi Institute, ³Shanghai AI Lab, ⁴Fudan University, ⁵Spirit AI

^* indicates equal contributions

🛠️ Installation

We manage Python dependencies with uv. If you haven't installed uv, please follow uv installation instructions to set it up.

Run the following to set up the environment:

GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .

NOTE: GIT_LFS_SKIP_SMUDGE=1 is needed to pull LeRobot as a dependency.

For more details, refer to the original openpi repository.

🚀 Training OneTwoVLA

Download the dataset and place them under $LEROBOT_HOME/umi/.

To train a OneTwoVLA model, run:

bash train_scripts/train_<task_name>.sh

Available tasks are:

train_scripts
|-- train_onetwovla_cocktail.sh
|-- train_onetwovla_visual_grounding.sh
|-- train_pi0_cocktail.sh
|-- train_pi0_visual_grounding.sh

🦾 Real-World Deployment

We run inference using a policy server and a hardware client. The instructions for running policy server can be found at examples/umi/README.md, and we provide the UMI hardware client code in this repository.

📷 Data

We provide access to the following datasets:

Robot Datasets: Datasets for the cocktail and open-world visual grounding tasks.
Vision-Language Datasets: Datasets contains synthetic images and annotated reasoning for the open-world visual grounding task.

All datasets are hosted on Hugging Face. You can find them here.

We provide code for converting UMI data format to LeRobot data format here.

Synthetic Image Augmentation

To make the synthetic images more closely resemble real robot observations, we randomly apply several augmentations, including random fisheye distortion and compositing a robot gripper with adaptive brightness adjustments. The implementation is available in scripts/augment_vl_data/augment.py.

Here we show an example. From left to right, the images are: the original image, the image with fisheye distortion, the image compositing a robot gripper with adaptive brightness adjustments, and the image with both applied.

🙏 Acknowledgements

We express our sincere gratitude to the developers of the openpi for open-sourcing their code.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docs		docs
examples		examples
figures		figures
packages/openpi-client		packages/openpi-client
scripts		scripts
src/openpi		src/openpi
train_scripts		train_scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning

🛠️ Installation

🚀 Training OneTwoVLA

🦾 Real-World Deployment

📷 Data

Synthetic Image Augmentation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

Fanqi-Lin/OneTwoVLA

Folders and files

Latest commit

History

Repository files navigation

OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning

🛠️ Installation

🚀 Training OneTwoVLA

🦾 Real-World Deployment

📷 Data

Synthetic Image Augmentation

🙏 Acknowledgements

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages