CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Tue, 16 Dec 2025 01:14:55 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"6940b28f-6740" expires: Sun, 28 Dec 2025 08:28:27 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: E947:1F53DD:767A9B:84C14E:6950E7D2 accept-ranges: bytes age: 0 date: Sun, 28 Dec 2025 08:18:27 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210095-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766909907.253107,VS0,VE205 vary: Accept-Encoding x-fastly-request-id: cafd3eec69ab430c1bd1c5aa0f7ce8ed4c3d9cd0 content-length: 5103 VLA-0: Building State-of-the-Art VLAs with Zero Modification

VLA-0

Building State-of-the-Art VLAs with Zero Modification

Ankit Goyal Hugo Hadfield Xuning Yang Valts Blukis Fabio Ramos

NVIDIA

VLA-0 converts a VLM into a VLA by prompting it to predict actions as text

✨ Achieves state-of-the-art results without any architectural changes

📄 Paper (arXiv) 💻 Code 📝 Cite

Summary

Vision-Language-Action models (VLAs) hold immense promise for enabling generalist robot manipulation. However, the best way to build them remains an open question. Current approaches often add complexity, such as modifying the existing vocabulary of a Vision-Language Model (VLM) with action tokens or introducing special action heads.

Curiously, the simplest strategy of representing actions directly as text has remained largely unexplored.

This work introduces VLA-0 to investigate this idea. We find that VLA-0 is not only effective; it is surprisingly powerful. With the right design, VLA-0 outperforms more involved models.

                    Key Findings
                    ✓ Outperforms all methods trained on the same robotic data on LIBERO benchmark
✓ Outperforms methods with large-scale pretraining (π₀, π₀.₅-KI, GR00T-N1, MolmoAct)
✓ Outperforms SmolVLA in real-world tasks despite no large-scale pretraining
✓ Requires no architectural changes to the base VLM

                

How VLA-0 Compares

We categorize existing VLAs into three families. VLA-0 takes the simplest approach.

🔢

Discrete Token VLAs

Examples: RT-2, OpenVLA

Discretize actions into bins
Assign tokens from VLM vocabulary
⚠️ Limited action resolution
⚠️ Compromises language understanding

🧠

Generative Action Head VLAs

Examples: π₀, SmolVLA

Attach action generation head
Use diffusion or flow matching
⚠️ Requires new neural network
⚠️ May degrade VLM capabilities

⚙️

Custom Architecture VLAs

Examples: OpenVLA-OFT, π-FAST

Specialized modifications
Custom tokenizers
⚠️ Significant changes
⚠️ Complex training pipelines

✨

VLA-0 (Ours)

Zero modification approach

✓ Actions as text (integers)
✓ No vocabulary changes
✓ No architectural changes
✓ Arbitrary action resolution

Results

Real-World Performance

VLA-0 outperforms SmolVLA on real robot tasks using the SO-100 platform

Real-world demonstrations: block reorientation, apple pushing, banana and cupcake pick-and-place

🤖

Tasks Evaluated

📊

+12.5%

vs SmolVLA

⚡

4 Hz

Inference Speed

🎓

100

Demos per Task

Key Insight: VLA-0 achieves these results without any large-scale SO-100 pretraining, while SmolVLA was specifically pretrained on this dataset.

Performance on LIBERO Benchmark

VLA-0 achieves the best performance among models without large-scale pretraining

Model	Large-scale Pretrain	Type	Spatial	Object	Goal	Long	Average	Avg. Rank
Models without large-scale pretraining
Diffusion Policy	✗	N/A	78.3	92.5	68.3	50.5	72.4	6.5
π₀-FAST (Paligemma)	✗	Custom	87.0	63.0	89.0	48.0	71.8	6.0
SmolVLA (0.24B)	✗	Gen Head	87.0	93.0	88.0	63.0	82.8	5.3
SmolVLA (2.25B)	✗	Gen Head	93.0	94.0	91.0	77.0	88.8	4.0
OpenVLA-OFT	✗	Custom	94.3	95.2	91.7	86.5	91.9	2.8
π₀.₅-KI	✗	Gen Head	96.6	97.2	94.6	85.8	93.3	2.3
VLA-0 (Ours)	✗	Simple	97.0	97.8	96.2	87.6	94.7	1.0
Models with large-scale pretraining (for reference)
Octo	✓	Gen Head	78.9	85.7	84.6	51.1	75.1	8.8
OpenVLA	✓	Dis. Tok.	84.7	88.4	79.2	53.7	76.5	8.0
π₀-FAST	✓	Custom	90.0	86.0	95.0	73.0	86.0	6.5
MolmoAct	✓	Dis. Tok.	87.0	95.4	87.6	77.2	86.8	6.5
GR00T-N1	✓	Gen Head	94.4	97.6	93.0	90.6	93.9	4.5
π₀	✓	Gen Head	96.8	98.8	95.8	85.2	94.2	3.3
π₀.₅-KI	✓	Gen Head	98.0	97.8	95.6	85.8	94.3	3.0
OpenVLA-OFT	✓	Custom	97.6	98.4	97.9	94.5	97.1	1.5
VLA-0 (Ours)	✗	Simple	97.0	97.8	96.2	87.6	94.7	2.8

94.7%
Average Success Rate
Best among non-pretrained models

#1
Rank
Among non-pretrained VLAs

#2.8
Overall Rank
Including pretrained models

Citation

If you find VLA-0 useful in your research, please consider citing:

@article{goyal2025vla0,
  title={VLA-0: Building State-of-the-Art VLAs with Zero Modification},
  author={Goyal, Ankit and Hadfield, Hugo and Yang, Xuning and Blukis, Valts and Ramos, Fabio},
  journal={arXiv preprint arXiv:2510.13054},
  year={2025}
}

Resources

📄 Paper (arXiv) arXiv:2510.13054 🌐 Project Website Additional resources and updates 💻 Code & Models GitHub repository

Original Source | Taken Source