| CARVIEW |
Highlights
(1) Breakthrough Resolution Capability: As the first model to achieve native high-quality 4K video generation, UltraGen seamlessly scales pre-trained low-resolution (≤720P) video diffusion models to 1080P/2K/4K, eliminating the "pseudo-high-resolution" limitation of traditional super-resolution pipelines. It delivers authentic, detail-rich content that outperforms state-of-the-art methods in both 1080P and 4K tasks.
(2) Efficient Hierarchical Dual-Branch Attention: UltraGen innovatively decouples full attention into local and global branches to solve the quadratic computational bottleneck of diffusion transformers. The local branch focuses on fine-grained regional details, while the global branch ensures semantic consistency—all with a hierarchical design that avoids the O((T·H·W)²) complexity of traditional models.
(3) Computational Efficiency & Speed Advantage: With its spatially compressed global modeling and cross-window local attention, UltraGen achieves a 4.78× speedup for 4K video generation and 2.69× speedup for 1080P generation compared to the popular Wan-T2V-1.3B baseline. It makes high-resolution video training and inference feasible without excessive hardware costs.
(4) Superior Quality Across Metrics: In quantitative evaluations, UltraGen sets new standards: Lowest HD-FVD (214.12 for 1080P, 424.61 for 4K) for high-res similarity to real videos; Highest HD-MSE (390.19 for 1080P, 386.01 for 4K) and HD-LPIPS (0.5455 for 1080P, 0.6450 for 4K) for fine-grained details; temporal consistency (0.9827 for 1080P) and top CLIP scores among native high-res generators, ensuring prompt alignment and smooth frame transitions.
Abstract
Recent advances in video generation have made it possible to produce visually compelling videos, with wide-ranging applications in content creation, entertainment, and virtual reality. However, most existing diffusion transformer based video generation models are limited to low-resolution outputs (≤720P) due to the quadratic computational complexity of the attention mechanism with respect to the output width and height. This computational bottleneck makes native high-resolution video generation (1080P/2K/4K) impractical for both training and inference. To address this challenge, we present UltraGen, a novel video generation framework that enables i) efficient and ii) end-to-end native high-resolution video synthesis. Specifically, UltraGen features a hierarchical dual-branch attention architecture based on global-local attention decomposition, which decouples full attention into a local attention branch for high-fidelity regional content and a global attention branch for overall semantic consistency. We further propose a spatially compressed global modeling strategy to efficiently learn global dependencies, and a hierarchical cross-window local attention mechanism to reduce computational costs while enhancing information flow across different local windows. Extensive experiments demonstrate that UltraGen can effectively scale pre-trained low-resolution video models to 1080P and even 4K resolution for the first time, outperforming existing state-of-the-art methods and super-resolution based two-stage pipelines in both qualitative and quantitative evaluations.
Framework
Overview of our UltraGen that decomposes the full-attention into a global attention branch for overall semantic consistency and a local attention branch for high-fidelity regional content, boosting high-efficiency and high-resolution video generation.
Qualitative Comparison
Comparison results of existing state-of-the-art video generation methods on 1080P video generation. The red boxes highlight zoomed-in regions, where our model produces the clearest high-resolution videos with the most fine-grained details.
Quantitative comparisons
Quantitative comparisons. Our UltraGen demonstrates superior high-quality HD video generation capabilities. Bold indicates the best performance and * indicates the best performance among all the non-SR methods.