| CARVIEW |
FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers
Xuanhua He1,
Quande Liu2,†,
Zixuan Ye1,
Weicai Ye2,
Qiulin Wang2,
Xintao Wang2,
Qifeng Chen1,
Pengfei Wan2,
Di Zhang2,
Kun Gai2
1The Hong Kong University of Science and Technology
2Kuaishou Technology
†Corresponding author
Abstract
Fine-grained and efficient controllability on video diffusion transformers has raised increasing desires for the applicability. Recently, In-context Conditioning emerged as a powerful paradigm for unified conditional video generation, which enables diverse controls by concatenating varying context conditioning signals with noisy video latents into a long unified token sequence and jointly processing them via full-attention, e.g., FullDiT. Despite their effectiveness, these methods face quadratic computation overhead as task complexity increases, hindering practical deployment. In this paper, we study the efficiency bottleneck neglected in original in-context conditioning video generation framework. We begin with systematic analysis to identify two key sources of the computation inefficiencies: the inherent redundancy within context condition tokens and the computational redundancy in context-latent interactions throughout the diffusion process. Based on these insights, we propose FullDiT2, an efficient in-context conditioning framework for general controllability in both video generation and editing tasks, which innovates from two key perspectives. Firstly, to address the token redundancy in context conditions, FullDiT2 leverages a dynamical token selection mechanism to adaptively identity important context tokens, reducing the sequence length for unified full-attention. Additionally, a selective context caching mechanism is devised to minimize redundant interactions between condition tokens and video latents throughout the diffusion process. Extensive experiments on six diverse conditional video editing and generation tasks demonstrate that FullDiT2 achieves significant computation reduction and 2-3 times speedup in averaged time cost per diffusion step, with minimal degradation or even higher performance in video generation quality.
FullDiT2 Showcase: Diverse Capabilities
Highlighting the visual quality and controllability of FullDiT2 across various video generation and editing tasks.
Showcasing: ID Insertion
FullDiT2 demonstrates high-fidelity insertion and can even outperform baselines in identity preservation for this task.
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Showcasing: ID Swap
FullDiT2 effectively swaps identities while maintaining scene coherence and video quality.
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Showcasing: ID Deletion
FullDiT2 cleanly removes specified subjects or objects with minimal artifacts.
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Showcasing: Video Re-Camera
Generates video from new camera perspectives based on a reference video and target camera trajectory, handling multiple dense conditions efficiently.
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Showcasing: Pose-to-Video
Creates realistic and temporally consistent video driven by pose sequences, accurately following pose guidance.
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Showcasing: Trajectory-to-Video
Generates dynamic video content following specified camera trajectories with good alignment.
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Our Approach: FullDiT2
Traditional approaches to conditional video generation, such as adapter-based methods, often require introducing additional network structures for specific tasks, which can be less flexible. As shown in Figure 1, In-Context Conditioning (ICC) as exemplified by models like FullDiT offers a more unified solution by concatenating condition tokens with noisy latents and processing them jointly, achieving diverse control capabilities. However, this token concatenation strategy, while effective, introduces a significant computational burden due to the quadratic complexity of full attention on these extended sequences. To address this challenge, we propose FullDiT2, an efficient ICC framework. FullDiT2 inherits the versatile context conditioning mechanism but introduces two key innovations to mitigate the computational overhead: 1) Dynamic Token Selection to reduce sequence length for full-attention by identifying important context tokens, and 2) Selective Context Caching to minimize redundant computations by efficiently caching and skipping context tokens across diffusion steps and blocks. Our method thus realizes an efficient and effective ICC framework for controllable video generation and editing.
1. Dynamic Token Selection
To address token redundancy where many context tokens might be less informative, each Transformer block in FullDiT2 adaptively selects an informative subset of reference tokens (e.g., top 50% in our implementation) using a lightweight, learnable importance prediction network operating on reference Value vectors. This reduces the sequence length for attention involving reference tokens, lowering computational cost from $O((n_z+n_c)^2)$ towards $O((n_z+k)^2)$. Unselected reference tokens bypass the attention mechanism and are re-concatenated after the Feed-Forward Network to preserve their information for subsequent layers.
2. Selective Context Caching
To tackle computation redundancy across timesteps and layers, FullDiT2 first identifies important layers for reference token processing using a Block Importance Index. Only these pre-selected important layers (e.g., 4 layers with highest BI plus the first layer for token projection in our model) process reference information; intermediate layers only process noisy tokens, with reference representations passed directly between important layers. For temporal efficiency, especially given that context tokens are relatively static across diffusion steps compared to noisy latents, we cache the Key (K) and Value (V) of selected top-k reference tokens from the first sampling step ($T_0$). These cached K/V values are then reused in subsequent steps for the non-skipped layers, avoiding redundant re-computation. Decoupled attention is employed to maintain training-inference consistency during this caching process, as naive caching can lead to misalignment.
Comparison of our FullDiT2 with adapter-based methods and Full-DiT.
FullDiT2 Framework Overview
Comparisons
Task-Specific Comparisons against Baseline (FullDiT)
FullDiT2 consistently achieves significant speedups while maintaining or improving generation quality compared to the FullDiT baseline across various tasks.
ID Insertion
Quantitative Highlights
- Speedup (Ours): 2.287x
- GFLOPS (Baseline vs Ours): 69.292 vs 33.141 (Ours is lower)
- CLIP-I (Baseline vs Ours): 0.568 vs 0.605 (Ours is higher)
- DINO-S (Baseline vs Ours): 0.254 vs 0.313 (Ours is higher)
- FullDiT2 can even outperform the baseline in ID insertion tasks.
Case 1
Case 2
ID Swap
Quantitative Highlights
- Speedup (Ours): 2.287x
- GFLOPS (Baseline vs Ours): 69.292 vs 33.141
- CLIP-I (Baseline vs Ours): 0.619 vs 0.621
Case 1
Case 2
ID Deletion
Quantitative Highlights
- Speedup (Ours): 2.287x
- GFLOPS (Baseline vs Ours): 69.292 vs 33.141
Case 1
Case 2
Video Re-Camera
Quantitative Highlights
- Speedup (Ours): 3.433x
- GFLOPS (Baseline vs Ours): 101.517 vs 33.407 (~32% of baseline)
- RotErr / TransErr: Comparable or improved (e.g. Baseline 6.173 TransErr vs Ours 5.730)
Case 1
Case 2
Pose-to-Video
Quantitative Highlights
- Speedup (Ours): 2.143x
- GFLOPS (Baseline vs Ours): 64.457 vs 33.111
- PCK (Pose Control): Maintained (e.g. Baseline 72.445 vs Ours 71.408)
Case 1
Case 2
Trajectory-to-Video
Quantitative Highlights
- Speedup (Ours): 2.143x
- GFLOPS (Baseline vs Ours): 64.457 vs 33.111
- RotErr / TransErr: Maintained (e.g. Baseline 1.471 / 5.755 vs Ours 1.566 / 5.714)
Case 1
Case 2
Task-based Comparison with Acceleration Techniques
Comparing FullDiT2 with other acceleration methods (Delta-DiT, FORA) on specific tasks, focusing on output quality and conditioning adherence under similar speedup conditions (conceptual).
ID Insert
ID Swap
ID Delete
Video Recamera
Trajectory-to-Video
Pose to Video
Efficiency and Performance Gains Summary
FullDiT2 demonstrates substantial improvements in computational efficiency while maintaining or even enhancing video generation quality across six diverse tasks.
- Significant Speedup: Achieves 2-3 times speedup in averaged time cost per diffusion step compared to the baseline FullDiT. For instance, in ID-related video editing tasks, FullDiT2 achieves approximately 2.28x speedup.
- Reduced Computational Cost: Particularly pronounced in tasks with multiple conditions, such as Video Re-Camera, where FullDiT2 reduces computational cost to only 32% of baseline FLOPs and achieves a 3.43x speedup.
- Preserved/Improved Quality: Maintains high fidelity and accurately adheres to various conditioning inputs, achieving results comparable to or even outperforming the baseline. For example, FullDiT2 can outperform in ID insertion tasks.