NeurIPS 2025
Piyush Bagad, Andrew Zisserman
University of Oxford
LiFT learns time-aware video representations that can linearly separate temporally opposite (chiral) actions like "opening" vs "closing" or "moving up" vs "moving down".
Key observation: tSNE projections of per-frame features from DINOv2 show that they lie on a time-sensitive trajectory. Can we use these to learn a time-aware video representation?
Inspired by perceptual straightening: LiFT transforms non-linear DINO trajectories into a compact video embedding under a linearized Auto-Encoder model, inspired by the perceptual straightening hypothesis [Hénaff et al., Nature 2019].
What we contribute:
- Model: LiFT - a compact (768-dim) time-aware video embedding trained in an unsupervised manner
- Benchmark: Chirality in Action (CiA) - a new benchmark built from SSv2, EPIC, and Charades datasets to evaluate temporal understanding
First, create a conda environment:
conda create --name lift python=3.11 -y
conda activate liftThen, install the LiFT package:
pip install git+https://github.com/bpiyush/LiFT.gitAlternative: Manual installation with conda
If you prefer more control over dependencies, create a conda environment:
conda create --name lift python=3.11 -y
conda activate lift
# Install torch
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
# Install lightning
pip install lightning==2.4.0
# Install other dependencies
pip install einops==0.8.1
pip install timm==1.0.22
pip install decord==0.6.0
pip install matplotlib==3.9.2
pip install opencv-python pandas ipdb ipywidgets tqdm scikit-learn termcolor seaborn ffmpeg-python
# Install gdown for downloading model weights
pip install gdownDownload the pre-trained LiFT model weights (~110MB):
# Download the checkpoint file
gdown 1DFapOrZwRcltyq3_tQNTQ9mHtpgKqtZY -O ggwirp95-epoch=458-step=834003.ckptAlternatively, you can manually download from Google Drive.
# Set path to your video
video_path = "your_video.mp4"
import torch
from lift import DINOv2ForVideo, make_classification_eval_transform, load_lift_module
from lift.dinov2 import compute_dino_features_for_single_video
from lift.demo import compute_lift_embeddings
from lift.viz_utils import show_trajectory_with_reconstruction
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load models
backbone = DINOv2ForVideo(model_id='vit_small_patch14_reg4_dinov2.lvd142m').to(device)
preprocess = make_classification_eval_transform()
lift_model = load_lift_module(ckpt_root=".", ckpt_name="ggwirp95-epoch=458-step=834003.ckpt").to(device)
# Extract features from your video
frames, _, dino_feats = compute_dino_features_for_single_video(
video_path, preprocess, backbone, return_frames=True, device=device, n_frames=16
)
# Get LiFT embedding (768-dim time-aware video representation)
lift_output = compute_lift_embeddings(dino_feats.unsqueeze(0), lift_model, device=device)
embedding = lift_output["concat"] # Shape: [1, 768]
# Visualize tSNE (DINO trajectory in red, LiFT reconstruction in blue)
img = show_trajectory_with_reconstruction(
video_path=video_path,
x=dino_feats,
x_hat=lift_output["reconstructed"].squeeze(0),
class_name="my video",
method="tsne",
joint_dimred=True,
return_img=True,
)
img.save("lift_output.png")Visualization of the DINO trajectory (red) and LiFT reconstruction (blue).
Alternative: Run the demo script
cd LiFT
export PYTHONPATH=$PWD
python lift/demo.py --ckpt_root . --ckpt_name ggwirp95-epoch=458-step=834003.ckptIf you find this work useful, please consider citing:
@InProceedings{BagadLiFT25,
author = "Piyush Bagad and Andrew Zisserman",
title = "Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening",
booktitle = "NeurIPS",
year = "2025",
}Please also consider checking out the following papers:
- Seeing the Arrow of Time in Large Multimodal Models. NeurIPS (2025).
- Retro-Actions: Learning ‘Close’ by Time-Reversing ‘Open’ Videos. ICCVW (2019).
- Perceptual straightening of natural videos. Nature Neuroscience (2019).



