Sandeep Routray1,2, Hengkai Pan1, Unnat Jain2,3, Shikhar Bahl2, Deepak Pathak1,2
Corresponding author: Sandeep Routray
- [2025/10/13] ViPRA accepted for an Oral at the NeurIPS 2025 EWM Workshop.
- [2025/10/01] ViPRA accepted at the NeurIPS 2025 SpaVLE Workshop.
- A recipe to learn generalist robot policies from large-scale human and robot videos without action labels.
- A novel approach to extract motion-centric latent actions that capture fine-grained physical dynamics.
- A flow matching action decoder with action chunking for high-frequency continuous control.
- Outperforms prior latent action methods and VLA baselines trained on ground-truth actions.
The latent action model learns motion-centric abstract representations from actionless video. These latents capture fine-grained temporal dynamics and are discretized into tokens that serve as "latent actions" for downstream policy learning.
Key Features
- Actionless Learning: Learns from videos directly; no action annotations required.
- Motion-Centric: Focuses on fine-grained temporal dynamics rather than static appearance.
- Multi-Dataset: Trained on diverse human and robot data.
- Optical Flow Consistency: Uses optical flow for temporal consistency regularization.
Architecture
- Spatial Encoder: DINOv2-initialized vision transformer for spatial features.
- Spatio-Temporal Encoder: Non-causal transformer encoder over video clips.
- Vector Quantizer: Noise Substitution Vector Quantization (NSVQ) for discretizing latent action.
- Spatio-Temporal Decoder: Causal transformer decoder for reconstruction.
- Flow Network: RAFT-based optical flow estimation for consistency loss.
cd laq/
conda env create -f environment.yml -n laq
conda activate laqTraining configs live in laq/configs/config.py. Key parameters:
- Model: 768-dim transformer, 6 encoder layers, 8 decoder layers.
- Data: 224×224 crops, 8-frame sequences.
- Quantization: 32-dim latent space, NSVQ codebook.
- Losses: L1 reconstruction, LPIPS perceptual loss, optical-flow consistency loss.
- Training: ~300k steps, batch size 18, bf16 on 8×H200 GPUs, grad norm clip 6.0.
You can match these layouts or extend laq/model/data.py to support your own.
ssv2/
├── labels/
│ ├── train.json
│ ├── validation.json
│ └── test.json
├── 20bn-something-something-v2/
│ ├── [video_id].webm
│ └── ...
Example config:
ssv2 = dict(
root_dir=Path("/path/to/ssv2"),
split="trainval", # "train", "val", "trainval", "test", "all"
stepsize=2, # frame sampling stride
)dataset_name/
├── processed/
│ ├── trajectory_001/
│ │ └── images/
│ │ ├── 000000.jpg
│ │ ├── 000001.jpg
│ │ └── ...
│ ├── trajectory_002/
│ └── ...
Example config:
bridge = dict(
root_dir=Path("/path/to/bridge"),
split="trainval",
num_trajs=dict(trainval=25460, val=2546),
stepsize=1,
)LIBERO/
├── libero_10_modified/
│ └── images/trajectory_001/000000.jpg
├── libero_goal_modified/
│ └── images/...
├── libero_object_modified/
│ └── images/...
└── libero_spatial_modified/
└── images/...
Example config:
libero = dict(
root_dir=Path("/path/to/LIBERO"),
split="trainval",
num_trajs=dict(trainval=1.0, val=0.1), # float = percentage
stepsize=1,
)- Add a discovery function in
laq/model/data.py:
def discover_custom_sequences(data_root: Path, mode: str, **kwargs) -> List[str]:
# return list of frame directories / trajectories
return list_of_paths- Add your dataset case in
VideoDatasetCoTrain. - Add your config block to
laq/configs/config.py.
Launch training using the provided script, configured for bf16 training on a single node with 8 H200 GPUs:
bash run_train_laq.shTo reproduce codebook analysis and figures shown in the paper:
# Codebook usage analysis (reproduces codebook utilization figures)
python -m codebook_usage
# Rollout transfer evaluation (reproduces reconstruction and transfer results)
python -m rollout_transferTo use the LAQ model to generate training data with latent actions for ViPRA policy pretraining, use the dataset-specific latent generation scripts:
# LIBERO
python -m inference.libero.libero_latent
# OpenX-style datasets (Fractal, BridgeData V2, Kuka)
python -m inference.openx.openx_latent --dataset bridge
python -m inference.openx.openx_latent --dataset kuka
# SSv2
python -m inference.ssv2.ssv2_latentThese scripts generate training data in JSONL format with multi-GPU processing and automatic shard merging. Each line contains a training sample with latent actions:
Sample JSONL Entry:
{
"instruction": "pick up the red block and place it in the blue bowl",
"raw_action": [0.1, -0.2, 0.05, 0.0, 0.0, 0.0, 1.0],
"image": ["libero_10_modified/images/traj_001/step0000.jpg", "libero_10_modified/images/traj_001/step0001.jpg"],
"latent_state": ["libero_10_modified/images/traj_001/step0015.jpg"],
"latent_action_idxs": [3, 7, 1, 4, 2, 6, 0, 5, 1, 3, 7, 2, 4, 0, 6, 1],
"fields_la": "[instruction],[vision],latent_action",
"fields_ls": "[instruction],[vision],latent_state",
"fields_ls_la": "[instruction],[vision],latent_state,latent_action"
}The ViPRA policy builds on a video-language foundation model, Large World Model (LWM). We use the LWM-Chat-1M-Jax as the base model and extend it with additional modules for latent action prediction and flow matching for continuous control.
cd vipra/
conda env create -f environment.yml -n vipra
conda activate vipraBefore training, download the VQ-GAN image tokenizer, text tokenizer and pretrained model parameters from LWM-Chat-1M-Jax and place them under vipra/lwm/:
mkdir lwm
huggingface-cli download LargeWorldModel/LWM-Chat-1M-Jax --local-dir lwm/We release a pre-tokenized, horizon-14 dynamics dataset on Hugging Face:
mkdir cotrain_data
huggingface-cli download vipra-project/cotrain-dynamics14 --local-dir cotrain_data/cotrain-dynamics14 merges multiple robot datasets (LIBERO, BridgeData V2, Fractal, Kuka) with human video data from SSv2.
Each training sample includes:
- history frames
- latent state target
- latent action tokens from LAQ
- natural language task text
This dataset is already chunked into 14-step latent action sequences.
We also release a VQGAN vision cache on Hugging Face so you don't have to repeatedly tokenize raw pixels:
mkdir vision_cache
huggingface-cli download vipra-project/cotrain-vqgan-vision-cache --local-dir vision_cache/This contains precomputed VQGAN token sequences for each frame, which can be used instead of running the image tokenizer online.
If you don't use the cache, set vqgan_path to the VQ-GAN weights from LWM-Chat-1M-Jax so ViPRA can tokenize frames on the fly.
Launch pretraining using the provided script (configured for 8×H200 GPUs):
cd vipra/
bash scripts/pretrain.shSee vipra/scripts/pretrain.sh for full hyperparameters.
Download the pretrained checkpoint weights, VQ-GAN image tokenizer, and text tokenizer from Hugging Face:
cd vipra && mkdir vipra_checkpoints
huggingface-cli download vipra-project/vipra-7b-pretrained --local-dir vipra_checkpoints/For task-specific finetuning, prepare your dataset in JSONL format where each line represents a single timestep with the following structure:
{
"id": "ep00000/step0000",
"image": "ep00000/step0000.png",
"raw_action": [0.016, 0.0, -0.0, 0.0, 0.0, -0.0, -1.0],
"proprio": [0.003, -0.141, 0.011, -2.431, ...],
"instruction": "<s> You are a helpful assistant. USER: What action should the robot take to `put the white mug on the left plate` ASSISTANT:"
}We provide a full data processing pipeline example (shown here with LIBERO Long):
Step 1: Action Discretization
python data/finetune_preprocess_libero.py \
--input_path ./libero_10_raw.jsonl \
--output_filename ./libero_10_quant.jsonl \
--csv_filename ./quant_bins.csv \
--discretize_bins 2047 \
--task_name libero_10Step 2: Dynamics Formatting (14-step horizon, history, proprio)
python data/dynamics14_libero.py \
--input_jsonl ./libero_10_quant.jsonl \
--data_root ./ \
--csv_path ./quant_bins.csv \
--horizon 14 \
--action_type delta-eef \
--task_name libero_10Step 3: Action / Proprio Normalization
python data/normalize_libero.py \
--raw_jsonl ./libero_10_raw.jsonl \
--dynamics_jsonl ./libero_10_dynamics14_v2.jsonl \
--output_jsonl ./libero_10_final.jsonl \
--action_stats_json ./action_stats.json \
--proprio_stats_json ./proprio_stats.jsonTo launch finetuning (for LIBERO Long example):
cd vipra/
bash scripts/finetune_libero_long.shSee vipra/scripts/finetune_libero_long.sh for full hyperparameters.
ViPRA uses a client–server architecture for deployment: a server that runs inference and a lightweight client that sends observations and receives actions.
Start the inference server:
cd vipra/
bash scripts/run_server.sh [GPU_ID] [PORT]
# Examples:
bash scripts/run_server.sh 0 8005
bash scripts/run_server.sh 1 # GPU 1, default port 8005
bash scripts/run_server.sh # GPU 0, default port 8005The server is configured by the ViPRAConfig class in vipra/inference/dynamics_action_cont_server.py.
Default endpoint: https://localhost:8005
The ViPRAClient class in the client script in vipra/inference/dynamics_action_cont_client.py provides a simple interface to communicate with the inference server and obtain robot actions. The client can be customized for your particular use case and robot platform.
from inference.dynamics_action_cont_client import ViPRAClient
import numpy as np
client = ViPRAClient(
server_url="https://localhost:8005",
timeout=(1.0, 5.0),
image_size=256
)
task_description = "pick up the red block and place it in the blue bowl"
client.reset_policy(task_description)
image1 = np.random.randint(0, 255, (256, 256, 3), dtype=np.uint8)
image2 = np.random.randint(0, 255, (256, 256, 3), dtype=np.uint8)
# Two request modes available:
actions = client.get_action([image1, image2], mode="json") # JSON mode (baseline)
actions = client.get_action([image1, image2], mode="bytes") # JPEG mode (faster)API Endpoints
POST /step– JSON payload with images in nested lists.POST /step_bytes– multipart form data with JPEG-compressed images (recommended).POST /reset– reset policy and set a new task instruction.
conda env create -f client_environment.yml -n vipra-client
conda activate vipra-client- Lightweight: only requests, OpenCV, numpy
- No JAX / PyTorch required
- Can run on edge devices, laptops, etc.
If you find our code or models useful in your work, please cite ViPRA:
@misc{routray2025vipra,
title={ViPRA: Video Prediction for Robot Actions},
author={Sandeep Routray and Hengkai Pan and Unnat Jain and Shikhar Bahl and Deepak Pathak},
year={2025},
eprint={2511.07732},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2511.07732},
}ViPRA builds on LWM and LAPA. We thank the authors of these projects for open-sourcing their code and models.
ViPRA’s code and model weights are released under the Apache License 2.0.