| CARVIEW |
Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation
*Indicates Equal Contribution
Accepted by NeurIPS 2025 as a Spotlight Paper
Gallery - All videos are generated by SVG2!
TL;DR
Sparse VideoGen2 is a training-free framework that accelerates video diffusion models inference, using an innovative semantic-aware permutation technique and efficient dynamic attention kernels. It offers pareto-frontier performance with high fidelity and high generation speed.
Overview
Through a co-design of algorithms and systems, Sparse VideoGen2 provides an end-to-end solution that achieves measurable speedups on real-world hardware, offering greater efficiency and lower costs for video generation tasks.
Limitations of Current Sparse Attention Algorithms
While current video generation models are powerful, the 3D Full Attention mechanism is
extremely slow and requires significant computation resources. While various sparse
attention methods have been proposed to mitigate the high cost of full attention, they fall short due to
two critical drawbacks.
First, these methods suffer from inaccurate identification. Existing sparse methods often
rely on predefined, static patterns (e.g., local windows or strided attention) to select tokens.
Consequently, the aggregated activations become less representative, making the selection of critical
tokens inaccurate and degrade the video quality.
Second, these methods lead to significant computation waste. Even when these methods
reduce the total number of calculations, the scattered critical tokens introduce irregular and scattered
memory access patterns. Modern hardware like GPUs can not achieve peak performance when
accessing data
from incontiguous blocks of memory.
Efficient Sparse Attention via Semantic-Aware Permutation
We propose to rearrange the tokens in a way that brings semantically similar
tokens closer together in
global memory. We employ a lightweight, Semantic-Aware Permutation strategy to achieve
this.
This step happens on-the-fly for each timestep and layers, and is crucial for the efficiency of the
subsequent
clustering and attention approximation stages. We apply different permutation for query and key / value
tokens to further improve the accurateness of the permutation.
Identify semantically similar tokens using k-means clustering
We apply an efficient k-means clustering algorithm to decide the semantic similarity of the tokens and group them into several clusters, where each cluster represents a set of semantically similar tokens. The centroid of each cluster can be viewed as the representative of the cluster.More clusters can group the tokens in a more fine-grained manner, but might introduce large overhead when executing the k-means algorithm and reduce the attention kernel's performance. Vice versa. We find that 100~200 clusters for query tokens, and 500~1000 clusters for key / value tokens are a good trade-off. Efficiency results can be found below.
Approximate the attention map through Centroid-based Top-P estimation
With tokens grouped into clusters, we now decide the sparse attention patterns. Since centroids represent the clusters, we can approximate attention weight criticality based on them. Exact attention is computed only between queries centroids and key / value centroids, serving as a proxy for the full attention.We then use a Centroid-based Top-P Estimation to identify the most important weights. Top-P attention ensures quality while adapting the attention budget. Thus, we only compute exact attention for important weights, significantly reducing computation.
System Optimizations for Semantic-Aware Permutation
To further improve the performance of Semantic-Aware Permutation, we propose Centroids Cache to reduce the overhead of k-means by utilizing redundancy between timesteps. Compared with naively applying k-means for each timestep, Centroids Cache offers 76× speedup.By introducing semantic-aware permutation, we successfully reached the ideal hardware efficiency on GPUs. Compared with the scattered attention pattern without permutation, dynamic attention kernel offers 1.7× ~ 1.8× speedup.
Efficient Dynamic Block Size Attention kernels
Standard block-sparse attention kernels are optimized for fixed-size block sizes, which
is inefficient for our clustered approach where each cluster can have a different number of tokens.
To address this, we developed Efficient Dynamic Block Size Attention Kernels, supporting
both FlashAttention-2 and FlashAttention-3 algorithms.
These custom CUDA kernels are designed to handle variable-sized
blocks of data, allowing for highly efficient computation on the sparse, clustered attention map. This
hardware-level optimization ensures that our theoretical gains from approximation translate into
real-world speedups.
Our kernel has a very loose dependency on the cluster size on key / value tokens, making it highly
efficient even for small cluster size and enables a large number of clusters. For query tokens, we find
that having a larger block size (meaning that the number of blocks is smaller) is important for high
TFLOPs. Ablations on the number of clusters can be found below.
Quantitative Evaluation
We evaluate the performance of Sparse VideoGen 2 on Wan 2.1 and HunyuanVideo.
BibTeX
@article{yang2025sparse,
title={Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation},
author={Yang, Shuo and Xi, Haocheng and Zhao, Yilong and Li, Muyang and Zhang, Jintao and Cai, Han and Lin, Yujun and Li, Xiuyu and Xu, Chenfeng and Peng, Kelly and others},
journal={arXiv preprint arXiv:2505.18875},
year={2025}
}