| CARVIEW |
LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
Demo Videos
Minute-Length Videos
17-second Videos
Abstract
Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length.
We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch (check details below).
Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15\(\times\) (11.5\(\times\)) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation.
Framework Overview
LinGen replaces self-attention layers with a MATE block, which inherits linear complexity from its two branches: MA-branch and TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch introduces a novel Temporal Swin Attention block designed to capture correlations between spatially adjacent tokens and temporally medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly.
Exceptional Efficiency with Linear Compuatational Complexity
Comparisons with Existing Works
State-of-the-Art Commercial Models
Prompt: A fish swimming into a coffee shop and trying to order.
LumaLabs
Runway Gen3
Kling 1.5
LinGen (Ours)
Prompt: Camera zoom in. A chef chopping vegetables with speed.
LumaLabs
Runway Gen3
Kling 1.5
LinGen (Ours)
Typical Open-Source Models
Prompt: A dog wearing VR goggles on a boat.
T2V-Turbo
LinGen (Ours)
Prompt: Elderly artist painting by the sea.
CogVideoX-5B
LinGen (Ours)
Prompt: Noir street: neon, shadows, solitary walker.
OpenSora V1.2
LinGen (Ours)
Minute-Length Trials
Prompt: Aerial view of Santorini during the blue hour.
Loong
LinGen (Ours)*
*LinGen supports multiple aspect ratios, here we follow the baseline's setup to generate squared videos.
PA-VDM does not provide their prompts, so we find a similar video that is generated by LinGen
PA-VDM
LinGen (Ours)
Ablation Experiments
After 30K Pre-Training Steps at the 256p Resolution and the 17-Second Length
LinGen w/o TESA and RMS
LinGen w/o RMS
LinGen
After 2K Pre-Training Steps at the 512p Resolution and the 68-Second Length
LinGen w/o review tokens
LinGen w/ review tokens
Showing a Failure Case in which Consistency is Abnormally Bad at 256p Resolution
LinGen w/o Hybrid Training
LinGen w/ Hybrid Training
Showing a Failure Case in which Quality is Abnormally Bad at 512p Resolution
LinGen w/o Quality-Tuning
LinGen w/ Quality-Tuning
BibTeX
@inproceedings{wang2025lingen,
title={Lingen: Towards high-resolution minute-length text-to-video generation with linear computational complexity},
author={Wang, Hongjie and Ma, Chih-Yao and Liu, Yen-Cheng and Hou, Ji and Xu, Tao and Wang, Jialiang and Juefei-Xu, Felix and Luo, Yaqiao and Zhang, Peizhao and Hou, Tingbo and others},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={2578--2588},
year={2025}
}
@article{wang2025lingenuni,
title={LinGen-Uni: A Universal Linear-Complexity Framework for High-Resolution Minute-Length Text-to-Video Generation},
journal={Research Square Preprint},
author={Wang, Hongjie and Ma, Chih-Yao and Liu, Yen-Cheng and Hou, Ji and Xu, Tao and Wang, Jialiang and Juefei-Xu, Felix and Luo, Yaqiao and Zhang, Peizhao and Hou, Tingbo and others},
year={2025}
}