[NeurIPS 2025] One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

📄 PDF：neurips_camera_ready_1_token_per_frame_compression.pdf
🎥 SlidesLive：Link to Presentation

Enhancing long video understanding via extreme compression by progressively reducing each selected frame to a single token.

TLDR

Progressively compress video tokens to one token per frame. Achieve more comprehence long video understanding.

Experiment

XComp is a fine-tuned model from VideoChat-Flash-2B. The environment and the data are the same. Please refer to VideoChat-Flash for installation and data preparation.

Training ./llava-train_videochat
Evaluate ./lmms-eval_videochat

Download model parameters: Google Drive, save to XComp/llava-train_videochat/checkpoints/baseline_1000frame_cos/stagesuf-umt-hd-large-tome16_mlp_hd64_Qwen2_5_1_5B_stage3_short-long_mix_sft_mid2.yaml/

Citation

@inproceedings{
zhang2025one,
title={One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding},
author={Zheyu Aqa Zhang and Ziqi Pang and Shixing Chen and Xiang Hao and Vimal Bhat and Yu-Xiong Wang},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=bythzT0b81}
}

Acknowledgement

This work was supported in part by Amazon, NSF under Grants 2106825 and 2519216, and the DARPA Young Faculty Award. This work used computational resources, including Amazon Web Services (AWS), and the NCSA Delta and DeltaAI supercomputers through allocation CIS230012 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program.

We gratefully acknowledge the open-source projects that form the foundation of XComp: VideoChat-Flash, Qwen, and LLaVA-Video.

We also thank the open-source of relevant projects: UMT, lmms-eval, transformers, ToMe, PyramidDrop, LongVideoBench, MLVU, VideoMME, and LVBench.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
llava-train_videochat		llava-train_videochat
lmms-eval_videochat		lmms-eval_videochat
statics		statics
.gitignore		.gitignore
BENCH.md		BENCH.md
DATA.md		DATA.md
LICENSE		LICENSE
README.md		README.md
lmms-eval_videochat_packup.zip		lmms-eval_videochat_packup.zip
packup.py		packup.py
transformer_qwen2		transformer_qwen2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[NeurIPS 2025] One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

TLDR

Experiment

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

ZheyuAqaZhang/XComp

Folders and files

Latest commit

History

Repository files navigation

[NeurIPS 2025] One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

TLDR

Experiment

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages