You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enhancing long video understanding via extreme compression by progressively reducing each selected frame to a single token.
TLDR
Progressively compress video tokens to one token per frame. Achieve more comprehence long video understanding.
Experiment
XComp is a fine-tuned model from VideoChat-Flash-2B. The environment and the data are the same. Please refer to VideoChat-Flash for installation and data preparation.
Training ./llava-train_videochat
Evaluate ./lmms-eval_videochat
Download model parameters: Google Drive, save to XComp/llava-train_videochat/checkpoints/baseline_1000frame_cos/stagesuf-umt-hd-large-tome16_mlp_hd64_Qwen2_5_1_5B_stage3_short-long_mix_sft_mid2.yaml/
Citation
@inproceedings{
zhang2025one,
title={One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding},
author={Zheyu Aqa Zhang and Ziqi Pang and Shixing Chen and Xiang Hao and Vimal Bhat and Yu-Xiong Wang},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=bythzT0b81}
}
Acknowledgement
This work was supported in part by Amazon, NSF under Grants 2106825 and 2519216, and the DARPA Young Faculty Award. This work used computational resources, including Amazon Web Services (AWS), and the NCSA Delta and DeltaAI supercomputers through allocation CIS230012 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program.
We gratefully acknowledge the open-source projects that form the foundation of XComp: VideoChat-Flash, Qwen, and LLaVA-Video.
We also thank the open-source of relevant projects: UMT, lmms-eval, transformers, ToMe, PyramidDrop, LongVideoBench, MLVU, VideoMME, and LVBench.
About
[NeurIPS 2025] One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding