Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language

Why it's hard?

• Action & Vision is Heavily Context Dependent

Same view can lead to different movements and vice versa — because humans act in complex, embodied, goal-directed environments.

• Human Control is High-Dimensional and Structured

Full-body motion spans 48+ DoF with hierarchical, time-dependent dynamics—not synthetic control codes.

• Egocentric View Reveals Intention—But Hides the Body

First-person vision reflects goals, but not motion execution—models must infer consequences from invisible physical actions.

• Perception Lags Behind Action

Visual feedback often comes seconds later, requiring long-horizon prediction and temporal reasoning.

Results

Following Atomic Actions

We demonstrate samples of PEVA following atomic actions.

Move Forward

MY ALT TEXT

Rotate Left

MY ALT TEXT

Rotate Right

MY ALT TEXT

Move Left Hand Up

MY ALT TEXT

Move Left Hand Down

MY ALT TEXT

Move Left Hand Left

MY ALT TEXT

Move Left Hand Right

MY ALT TEXT

Move Right Hand Up

MY ALT TEXT

Move Right Hand Down

MY ALT TEXT

Move Right Hand Left

MY ALT TEXT

Move Right Hand Right

MY ALT TEXT

Long Video Generation

Generation Over Long-Horizons, including 16-second video generation examples. PEVA generates coherent 16-second rollouts conditioned on whole-body motion.

More Long Video Generation

Main Display
Thumb 107
Thumb 104
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94

Planning with Multiple Action Candidates

We explore PEVA's ability to serve as a world model by demonstrating a planning example by simulating multiple action candidates using PEVA and scoring them based on their perceptual similarity to the goal, as measured by LPIPS. Hover over the video to play.

PEVA enables us to rule out action sequences that leads us to the sink in the top row, and outdoors in the second row. PEVA allows us to find a reasonable sequence of actions to open the refridgerator in the third row.

PEVA enables us to rule out action sequences that leads us to the plants in the top row, and to the kitchen in the second row. PEVA allows us to find a reasonable sequence of actions to grab the box in the second row.

More Attempts with Planning

We formulate planning as an energy minimization problem and perform standalone planning in the same way as NWM (Bar et al., 2025) using the Cross-Entropy Method (CEM) (Rubinstein, 1997) besides minor modifications in the representation and initialization of the action. For simplicity, we conduct two experiments where we only predict moving either the left or right arm controlled by predicting the relative joint rotations represented as euler angles.

Main Display

In this case, we are able to predict a sequence of actions that raises our right arm to the mixing stick. We see a limitation with our method as we only predict the right arm so we do not predict to move the left arm down accordingly.

Thumb 107
Thumb 104
Thumb 94
Thumb 94
Thumb 94
Thumb 94

What can it unlock?

• Action–Perception Coupling

Humans act to see, and see to act.

• Moving Beyond Synthetic Actions

Prior world models use abstract control signals—ours models real, physical human action

• Toward Embodied Intelligence

Physically grounded video models bring us closer to agents that plan, adapt, and interact like humans

• Intention Understanding Through Prediction

Predicting what an agent will see is a path to inferring what it wants

Method

MY ALT TEXT

Random Timeskips: It allows the model to learn both short-term motion dynamics and longer-term activity patterns.

Sequence-Level Training: Model the entire sequence of motion by applying the loss over each prefix of frames.

Action Embeddings: Whole-body motion is high-dimensional, concatenate all actions at time t into a 1D tensor and use it to condition each AdaLN layer.

Quantitative Results

Atomic Action Performance

MY ALT TEXT

Comparison of models in generating videos of atomic actions.

Baselines

MY ALT TEXT

Baseline Perceptual Metrics.

Video Quality

MY ALT TEXT

Video Quality Across Time (FID).

Scaling

MY ALT TEXT

PEVA has good scaling ability. Larger models lead to better performance.

BibTeX

@misc{bai2025wholebodyconditionedegocentricvideo,
        title={Whole-Body Conditioned Egocentric Video Prediction}, 
        author={Yutong Bai and Danny Tran and Amir Bar and Yann LeCun and Trevor Darrell and Jitendra Malik},
        year={2025},
        eprint={2506.21552},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2506.21552}, 
  }

Acknowledgements

The authors thank Rithwik Nukala for his help in annotating atomic actions. We thank Katerina Fragkiadaki, Philipp Krähenbühl, Bharath Hariharan, Guanya Shi, Shubham Tulsiani and Deva Ramanan for the useful suggestions and feedbacks for improving the paper; Jianbo Shi for the discussion regarding control theory; Yilun Du for the support on Diffusion Forcing; Brent Yi for his help in human motion related works and Alexei Efros for the discussion and debates regarding world models. This work is partially supported by the ONR MURI N00014-21-1-2801.

 
Original Source | Taken Source