| CARVIEW |
Select Language
Results
Following Atomic Actions
We demonstrate samples of PEVA following atomic actions.
Move Forward
Rotate Left
Rotate Right
Move Left Hand Up
Move Left Hand Down
Move Left Hand Left
Move Left Hand Right
Move Right Hand Up
Move Right Hand Down
Move Right Hand Left
Move Right Hand Right
Long Video Generation
Generation Over Long-Horizons, including 16-second video generation examples. PEVA generates coherent 16-second rollouts conditioned on whole-body motion.
More Long Video Generation
Planning with Multiple Action Candidates
We explore PEVA's ability to serve as a world model by demonstrating a planning example by simulating multiple action candidates using PEVA and scoring them based on their perceptual similarity to the goal, as measured by LPIPS. Hover over the video to play.
PEVA enables us to rule out action sequences that leads us to the sink in the top row, and outdoors in the second row. PEVA allows us to find a reasonable sequence of actions to open the refridgerator in the third row.
PEVA enables us to rule out action sequences that leads us to the plants in the top row, and to the kitchen in the second row. PEVA allows us to find a reasonable sequence of actions to grab the box in the second row.
More Attempts with Planning
We formulate planning as an energy minimization problem and perform standalone planning in the same way as NWM (Bar et al., 2025) using the Cross-Entropy Method (CEM) (Rubinstein, 1997) besides minor modifications in the representation and initialization of the action. For simplicity, we conduct two experiments where we only predict moving either the left or right arm controlled by predicting the relative joint rotations represented as euler angles.
In this case, we are able to predict a sequence of actions that raises our right arm to the mixing stick. We see a limitation with our method as we only predict the right arm so we do not predict to move the left arm down accordingly.
What can it unlock?
• Action–Perception Coupling
Humans act to see, and see to act.
• Moving Beyond Synthetic Actions
Prior world models use abstract control signals—ours models real, physical human action
• Toward Embodied Intelligence
Physically grounded video models bring us closer to agents that plan, adapt, and interact like humans
• Intention Understanding Through Prediction
Predicting what an agent will see is a path to inferring what it wants
Method
Random Timeskips: It allows the model to learn both short-term motion dynamics and longer-term activity patterns.
Sequence-Level Training: Model the entire sequence of motion by applying the loss over each prefix of frames.
Action Embeddings: Whole-body motion is high-dimensional, concatenate all actions at time t into a 1D tensor and use it to condition each AdaLN layer.
Quantitative Results
Atomic Action Performance
Comparison of models in generating videos of atomic actions.
Baselines
Baseline Perceptual Metrics.
Video Quality
Video Quality Across Time (FID).
Scaling