| CARVIEW |
Video models are zero-shot
learners and reasoners
TL;DR
Veo 3 shows emergent zero-shot abilities across many visual tasks, indicating that video models are on a path to becoming vision foundation models—just like LLMs became foundation models for language.
Perception
Modeling
Manipulation
Reasoning
Abstract
The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled
natural language processing from task-specific models to unified, generalist foundation
models. This transformation emerged from simple primitives: large, generative models
trained on web-scale data. Curiously, the same primitives apply to today's generative
video models. Could video models be on a trajectory towards general-purpose vision
understanding, much like LLMs developed general-purpose language understanding?
We demonstrate that Veo 3 can zero-shot solve a broad variety of tasks it wasn't
explicitly trained for: segmenting objects, detecting edges, editing images, understanding
physical properties, recognizing object affordances, simulating tool use, and much more.
These abilities to perceive, model, and manipulate the visual world enable early forms of
visual reasoning like maze and symmetry solving. Veo 3's emergent zero-shot capabilities
indicate that video models are on a path to becoming unified, generalist vision foundation
models.
Podcast
On a run and want to get the gist of our paper? Listen to the following podcast!
Perception
Edge detection
Segmentation
Keypoint localization
Super-resolution
Blind deblurring
Blind denoising
Low-light enhancement
Conjunctive search / binding problem
Dalmatian illusion understanding
Shape cue-conflict understanding
Rorschach blot interpretation
Modeling
Material properties (flammability)
Rigid body transform
Soft body transform
Gravity (earth)
Gravity (moon)
Buoyancy (bottle cap)
Buoyancy (rock)
Visual Jenga
Object packing
Material optics (glass)
Material optics (mirror)
Color mixing (additive)
Color mixing (subtractive)
Categorizing objects
Omniglot (recognition)
Omniglot (generation)
Omniglot (parsing)
Memory of world states
Manipulation
Background removal
Style transfer
Colorization
Inpainting
Outpainting
Text manipulation
Image editing with doodles
Scene composition
Novel view synthesis
3D-aware reposing
Transfiguration
Professional headshot
Dexterous manipulation (jar)
Dexterous manipulation (throw/catch)
Dexterous manipulation (baoding balls)
Affordance recognition
Drawing
Visual instruction (burrito rolling)
Reasoning
Graph traversal
Tree BFS
Sequence (dots)
Sequence (arrows)
Sequence (circles)
Sequence (squares)
Connecting colors
Shape fitting
Sorting numbers
Tool use
Simple sudoku completion
Water puzzle solving
Maze solving (mouse)
Robot navigation
Rule extrapolation
Analogy (color)
Analogy (resize)
Analogy (reflect)
Analogy (rotate)
Maze (5x5)
Maze (7x7)
Maze (9x9)
Maze (irregular)
Symmetry (shape)
Symmetry (random)
Failure cases (click to expand) ▶
Monocular depth estimation
Monocular surface normal estimation
Force prompting
Motion trajectory prompting
Tying the knot
Connect the path puzzle
Letter word search
Eulerian path
Solving linear equations
Spot the difference
Visual IQ test
Glass falling
Collisions
Jigsaw puzzle
Sliding puzzle
Scrambled puzzle
Bottleneck
Laundry folding
Motion planning
BibTeX
@article{wiedemer2025video,
title={Video models are zero-shot learners and reasoners},
author={Wiedemer, Thadd{\"a}us and Li, Yuxuan and Vicol, Paul and Gu, Shixiang Shane and Matarese, Nick and Swersky, Kevin and Kim, Been and Jaini, Priyank and Geirhos, Robert},
journal={arXiv preprint arXiv:2509.20328},
year={2025}
}
Paper