| CARVIEW |
OneTwoVLA: A Unified Vision-Language-
Action Model with Adaptive Reasoning
Introducing OneTwoVLA, a single unified model capable of both reasoning and acting, and can adaptively switch between two modes. OneTwoVLA demonstrates superior performance in the following capabilities:
Fanqi Lin1,2,3,5*
Ruiqian Nai1,2,3,5*
Yingdong Hu1,2,3*
Jiacheng You1,2,3
Junming Zhao1,4
Yang Gao1,2,3,5†
1Tsinghua University, 2Shanghai Qi Zhi Institute, 3Shanghai AI Lab, 4Fudan University, 5Spirit AI
*Equal Contribution †Corresponding author
Task Planning
and Recovery
Interaction
Visual Grounding
Language Data
Click to jump to each section.
Long-Horizon Task Planning
OneTwoVLA excels at handling long-horizon manipulation tasks. It consistently demonstrates the ability to understand the physical scene, generate correct plans, track task progress accurately, and produce precise actions. This allows OneTwoVLA to successfully complete challenging tasks such as hotpot cooking, tomato-egg scramble, and cocktail mixing.
Hotpot Cooking
Tomato-Egg Scramble
Cocktail Mixing
Moreover, Co-training with our synthetic embodied-reasoning centric vision-language data enables OneTwoVLA to demonstrate generalizable planning capabilities on unseen tasks.
Generalizable Planning Tasks
Error Detection and Recovery
Recovering from mistakes is a critical capability for general-purpose robots. OneTwoVLA can detect errors in real-time, rapidly reason about recovery strategies, and subsequently generate corrective actions.
Show recovery for
Natural Human-Robot Interaction
To deploy robots in human-centric scenarios, the ability to interact naturally with humans is indispensable. Due to its adaptive nature and explicit reasoning process, OneTwoVLA is able to engage with humans in a natural way — seamlessly handling human interventions and proactively seek clarification when faced with ambiguities.
Show interaction for
Open-World Visual Grounding
Co-training OneTwoVLA with our synthetic embodied reasoning-centric vision-language data endows it with open-world visual grounding capabilities, enabling it to effectively comprehend spatial relationships, object attributes, and semantic features, even for objects unseen during training (e.g., GoPro, Sprite, Starbucks Coffee). The following videos demonstrate our robot successfully reaching to target objects based on language instructions.
Synthetic Vision-Language Data Examples
To further unlock OneTwoVLA's reasoning and generalization capabilities, we design a scalable, automatic pipeline for synthesizing embodied reasoning centric vision-language data without any artificial intervention, used for co-training with robot data. The task instructions for each synthetic image are categorized into two types: visual grounding tasks and long-horizon planning tasks. We show some examples here:
Visual Grounding
Reasoning: I need to pick up the brown knitted scarf which provides warmth.
Reasoning: I need to pick up the snow globe ornament containing a Christmas tree behind the book.
Reasoning: I need to pick up the book on the right side of the table.
Long-Horizon Planning
Reasoning Plan: 1. Pour the cherry tomatoes into the large wooden bowl. 2. Pour the arugula into the large wooden bowl. 3. Add some sliced cucumbers to the large wooden bowl. 4. Take the croutons and sprinkle them evenly on top of the salad. 5. Pour olive oil over the salad. 6. gently toss the ingredients together.
Hardware
BibTeX
@misc{lin2025onetwovlaunifiedvisionlanguageactionmodel,
title={OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning},
author={Fanqi Lin and Ruiqian Nai and Yingdong Hu and Jiacheng You and Junming Zhao and Yang Gao},
year={2025},
eprint={2505.11917},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2505.11917},
}