| CARVIEW |
Sparse VideoGen:
Accelerating Video Generation with
Spatial-Temporal Sparse Attention by 2x with High Pixel Fidelity
*Indicates Equal Contribution
Accepted by ICML 2025
TL;DR
Announcing Sparse VideoGen, a training-free method that accelerates video DiTs by achieving 2× speedup with high pixel fidelity (PSNR = 29). The secret? Unleashing spatial and temporal sparsity in 3D Full Attention. Dive into our paper and code to see the magic.
Overview
In the field of video generation, the latest and best-performing Video Diffusion Transformer models all
employ 3D Full Attention. However, their substantial computational demands pose significant challenges for
real-world applications. For example, HunyuanVideo takes 30 minutes to generate a 5-second video on
1×H100, which is prohibitively time-consuming due to the O(n^2) computation of 3D Full Attention.
To speed up their inference, we introduce Sparse VideoGen (SVG), a training-free
framework that leverages inherent spatial and temporal sparsity in the 3D Full Attention
operations. Sparse VideoGen's core contributions include
- Identifying the spatial and temporal sparsity patterns in video diffusion models.
- Proposing an Online Profiling Strategy to dynamically identify these patterns.
- Implementing an end-to-end generation framework through efficient algorithm-system co-design, with hardware-efficient layout transformation and customized kernels
3D Full Attention is Extremely Slow
State-of-the-art Video DiTs models (such as HunyuanVideo,
CogVideoX) adopt a 3D Full Attention
mechanism to capture complex spatial and temporal dependencies in video data. This approach offers better
generation quality compared to the 2D + 1D method.
However, since the computational complexity of
Attention increases quadratically with the context length, the inference time becomes excessively long.
For example, in HunyuanVideo it takes 29 minutes to generate a 5-second, 720p video on a single H100 GPU,
with attention operations consuming over 80% of the runtime. This computational bottleneck motivated us to
explore efficient attention mechanisms to accelerate DiTs while maintaining high generation quality and
pixel fidelity.
Unveiling Inherent Sparsity in 3D Full Attention
We observed two distinct sparsity patterns emerging in video diffusion's attention maps: spatial sparsity and temporal sparsity. We also found that most attention heads can be distinctly classified into one of these two categories. In HunyuanVideo, 29.2% of attention heads are spatial, 66.7% are temporal, and only 4.1% are ambiguous. Check our demo to understand our findings!
Spatial Head focus on spatially-local tokens
The Spatial Head focuses on spatially local tokens within the same frame and adjacent frames, resulting in a block-wise layout of the attention map. Since pixels in a single frame are tokenized into contiguous sequences, the Spatial Head attends to tokens corresponding to neighboring pixels, making the attention mask concentrate around the main diagonal. Spatial Head is essential for maintaining video quality and spatial consistency in generated videos.
Temporal Head focus on same token across frames
The Temporal Head is designed to capture relationships between tokens across different frames, facilitating the modeling of temporal dependencies. It employs a slash-wise layout with a constant stride, targeting tokens at consistent spatial locations over time. This mechanism is crucial for ensuring temporal consistency in the generated video sequences.
Oracle selection for sparse patterns for each head
While the Spatial and Temporal Heads individually address spatial and temporal consistency, their optimal combination is essential for achieving lossless performance in video generation. By assigning each attention head the attention mask that yields the lower mean squared error (MSE), successfully achieves a PSNR > 28 dB, indicating near-lossless performance.
Achieving High-Fidelity Compression with Our Online Profiling Strategy
Our next question is: how to efficiently select the appropriate sparsity pattern for each head? The
theoretical lossless performance above does not lead to real speedup since it requires the computation of
the full attention mask. The challenge arises since the oracle sparsity pattern is not
static; it actually varies across layers and denoising steps. This dynamic
nature necessitates an adaptive
and efficient method to determine the sparsity pattern on-the-fly.
To address this challenge, Sparse VideoGen proposes an Online Profiling Strategy to
dynamically identify and exploit these sparse attention patterns with very minimal overhead.
This strategy samples a subset of query tokens, and determines the most appropriate
sparsity pattern for each head based on the MSE on these sampled query tokens.
We find that a very small number of query tokens (64 out of 120k) is sufficient to accurately
predict the
optimal sparsity pattern. Meanwhile, since the sampled query token number is small, the overhead of the
Online Profiling Strategy is negligible, making it highly efficient.
Hardware-Efficient Layout Transformation Enables Theoretical Speedup
While exploiting spatial and temporal sparsity improves attention efficiency, a key challenge arises
from the non-contiguous memory access patterns inherent in temporal attention. Recall
that temporal heads
require accessing tokens at the same spatial position across multiple frames, resulting in an attention
mask composed of multiple thin, slash-wise patterns.
However, these tokens are often scattered in memory due to the conventional frame-wise token arrangement.
Such fragmented memory access leads to suboptimal utilization of GPUs, which are optimized for contiguous
memory operations. The actual speedup gain of the sparse attention kernel is much lower
than its
theoretical speedup given
by its sparsity.
To address this, the hardware-efficient layout transformation is introduced. This
technique rearranges the tensor layout into a token-wise order, ensuring that tokens required for temporal
attention
are stored contiguously in memory. By doing so, the layout transformation speedup the attention kernel by
1.7x
and achieves theoretical speedup ratio.
BibTeX
@article{xi2025sparse,
title={Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity},
author={Xi, Haocheng and Yang, Shuo and Zhao, Yilong and Xu, Chenfeng and Li, Muyang and Li, Xiuyu and Lin, Yujun and Cai, Han and Zhang, Jintao and Li, Dacheng and others},
journal={arXiv preprint arXiv:2502.01776},
year={2025}
}