| CARVIEW |
VLA-0
Building State-of-the-Art VLAs with Zero Modification
Ankit Goyal Hugo Hadfield Xuning Yang Valts Blukis Fabio Ramos
NVIDIA
VLA-0 converts a VLM into a VLA by prompting it to predict actions as text
Summary
Vision-Language-Action models (VLAs) hold immense promise for enabling generalist robot manipulation. However, the best way to build them remains an open question. Current approaches often add complexity, such as modifying the existing vocabulary of a Vision-Language Model (VLM) with action tokens or introducing special action heads.
Curiously, the simplest strategy of representing actions directly as text has remained largely unexplored.
This work introduces VLA-0 to investigate this idea. We find that VLA-0 is not only effective; it is surprisingly powerful. With the right design, VLA-0 outperforms more involved models.
Key Findings
- ✓ Outperforms all methods trained on the same robotic data on LIBERO benchmark
- ✓ Outperforms methods with large-scale pretraining (π₀, π₀.₅-KI, GR00T-N1, MolmoAct)
- ✓ Outperforms SmolVLA in real-world tasks despite no large-scale pretraining
- ✓ Requires no architectural changes to the base VLM
How VLA-0 Compares
We categorize existing VLAs into three families. VLA-0 takes the simplest approach.
Discrete Token VLAs
Examples: RT-2, OpenVLA
- Discretize actions into bins
- Assign tokens from VLM vocabulary
- ⚠️ Limited action resolution
- ⚠️ Compromises language understanding
Generative Action Head VLAs
Examples: π₀, SmolVLA
- Attach action generation head
- Use diffusion or flow matching
- ⚠️ Requires new neural network
- ⚠️ May degrade VLM capabilities
Custom Architecture VLAs
Examples: OpenVLA-OFT, π-FAST
- Specialized modifications
- Custom tokenizers
- ⚠️ Significant changes
- ⚠️ Complex training pipelines
VLA-0 (Ours)
Zero modification approach
- ✓ Actions as text (integers)
- ✓ No vocabulary changes
- ✓ No architectural changes
- ✓ Arbitrary action resolution
Results
Real-World Performance
VLA-0 outperforms SmolVLA on real robot tasks using the SO-100 platform
Real-world demonstrations: block reorientation, apple pushing, banana and cupcake pick-and-place
Performance on LIBERO Benchmark
VLA-0 achieves the best performance among models without large-scale pretraining
| Model | Large-scale Pretrain | Type | Spatial | Object | Goal | Long | Average | Avg. Rank |
|---|---|---|---|---|---|---|---|---|
| Models without large-scale pretraining | ||||||||
| Diffusion Policy | ✗ | N/A | 78.3 | 92.5 | 68.3 | 50.5 | 72.4 | 6.5 |
| π₀-FAST (Paligemma) | ✗ | Custom | 87.0 | 63.0 | 89.0 | 48.0 | 71.8 | 6.0 |
| SmolVLA (0.24B) | ✗ | Gen Head | 87.0 | 93.0 | 88.0 | 63.0 | 82.8 | 5.3 |
| SmolVLA (2.25B) | ✗ | Gen Head | 93.0 | 94.0 | 91.0 | 77.0 | 88.8 | 4.0 |
| OpenVLA-OFT | ✗ | Custom | 94.3 | 95.2 | 91.7 | 86.5 | 91.9 | 2.8 |
| π₀.₅-KI | ✗ | Gen Head | 96.6 | 97.2 | 94.6 | 85.8 | 93.3 | 2.3 |
| VLA-0 (Ours) | ✗ | Simple | 97.0 | 97.8 | 96.2 | 87.6 | 94.7 | 1.0 |
| Models with large-scale pretraining (for reference) | ||||||||
| Octo | ✓ | Gen Head | 78.9 | 85.7 | 84.6 | 51.1 | 75.1 | 8.8 |
| OpenVLA | ✓ | Dis. Tok. | 84.7 | 88.4 | 79.2 | 53.7 | 76.5 | 8.0 |
| π₀-FAST | ✓ | Custom | 90.0 | 86.0 | 95.0 | 73.0 | 86.0 | 6.5 |
| MolmoAct | ✓ | Dis. Tok. | 87.0 | 95.4 | 87.6 | 77.2 | 86.8 | 6.5 |
| GR00T-N1 | ✓ | Gen Head | 94.4 | 97.6 | 93.0 | 90.6 | 93.9 | 4.5 |
| π₀ | ✓ | Gen Head | 96.8 | 98.8 | 95.8 | 85.2 | 94.2 | 3.3 |
| π₀.₅-KI | ✓ | Gen Head | 98.0 | 97.8 | 95.6 | 85.8 | 94.3 | 3.0 |
| OpenVLA-OFT | ✓ | Custom | 97.6 | 98.4 | 97.9 | 94.5 | 97.1 | 1.5 |
| VLA-0 (Ours) | ✗ | Simple | 97.0 | 97.8 | 96.2 | 87.6 | 94.7 | 2.8 |
Citation
If you find VLA-0 useful in your research, please consider citing:
@article{goyal2025vla0,
title={VLA-0: Building State-of-the-Art VLAs with Zero Modification},
author={Goyal, Ankit and Hadfield, Hugo and Yang, Xuning and Blukis, Valts and Ramos, Fabio},
journal={arXiv preprint arXiv:2510.13054},
year={2025}
}