Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language

๐Ÿง  The Perceptual Straightening Hypothesis

Henaff et al. (2019) hypothesized that humans convert non-linear spatial representations of naturally occurring videos into linear temporal trajectories enabling their prediction with linear extrapolation. We are loosely inspired by this idea to transform DINO trajectories into a time-aware video embedding under a linearised Auto-Encoder model.
[1] Perceptual straightening of natural videos. Olivier J. Hรฉnaff, Robbe L. T. Goris and Eero P. Simoncelli. Nature 2019.

๐Ÿ—๏ธ The Model: LiFT

Please play the video animation below to understand the LiFT model design.
  • LiFT is trained in an unsupervised manner by reconstructing the input feature sequence.
  • What does LiFT learn? We observe that it essentially learns a smooth approximation of the feature trajectory. Furthermore, it is able to learn different embeddings for videos of opening vs closing door actions. See the qualitative results below.

๐ŸŽž๏ธ The Chirality in Action Benchmark

We repurpose three existing datasets (SSv2, EPIC, Charades) to mine chiral actions and build a new benchmark to probe video embedding models for chirality. We search for temporally opposite verbs using ChatGPT and then group together similar nouns to construct chiral groups.
Evaluation protocol: For each chiral group, we compute video embeddings for + and - samples. Then, we train a linear probe. The overall accuracy is averaged across all chiral groups.

๐Ÿ—’๏ธ Highlight Results

We evaluate LiFT on the CiA benchmark as well as standard action recognition benchmarks. First, we show that LiFT embeddings are time-sensitive (chiral-sensitive) & compact even outperforming much bigger video models like VideoJEPA and VideoMAE. Second, we show that LiFT encodes temporal information that is likely complementary to existing video models such as VideoJEPA. This is established by the performance gains we observe when concatenating LiFT embeddings with VideoJEPA embeddings on standard action recognition benchmarks.

๐Ÿ™ Acknowledgements

  • We thank Ashish Thandavan for support with infrastructure and Sindhu Hegde, Makarand Tapaswi, for useful discussions.
  • This research is funded by the EPSRC Programme Grant VisualAI EP/T028572/1, and a Royal Society Research Professorship RSRP\R\241003

๐Ÿ“œ Citation

If you find this work useful, please consider citing:

      @article{bagad2025chirality,
        title={Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening},
        author={Bagad, Piyush and Zisserman, Andrew},
        journal={arXiv preprint arXiv:2509.08502},
        year={2025}
      }
      

        @InProceedings{Bagad25,
          author       = "Piyush Bagad and Andrew Zisserman",
          title        = "Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening",
          booktitle    = "NeurIPS",
          year         = "2025",
        }
      

๐Ÿ“™ Related Work