CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Thu, 05 Jun 2025 03:44:53 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"684112b5-1f657" expires: Sun, 28 Dec 2025 16:02:22 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: FAFB:2916CC:7D2E37:8C40E6:69515235 accept-ranges: bytes age: 0 date: Sun, 28 Dec 2025 15:52:22 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210034-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766937142.903658,VS0,VE223 vary: Accept-Encoding x-fastly-request-id: 6bcf523cf6dba450cd2dbee2e4f1ed762b29c7d2 content-length: 14498 FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers

FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers

Xuanhua He¹, Quande Liu^2,†, Zixuan Ye¹, Weicai Ye², Qiulin Wang²,
Xintao Wang², Qifeng Chen¹, Pengfei Wan², Di Zhang², Kun Gai²

¹The Hong Kong University of Science and Technology
²Kuaishou Technology
^†Corresponding author

📄 Paper (PDF)

Abstract

Fine-grained and efficient controllability on video diffusion transformers has raised increasing desires for the applicability. Recently, In-context Conditioning emerged as a powerful paradigm for unified conditional video generation, which enables diverse controls by concatenating varying context conditioning signals with noisy video latents into a long unified token sequence and jointly processing them via full-attention, e.g., FullDiT. Despite their effectiveness, these methods face quadratic computation overhead as task complexity increases, hindering practical deployment. In this paper, we study the efficiency bottleneck neglected in original in-context conditioning video generation framework. We begin with systematic analysis to identify two key sources of the computation inefficiencies: the inherent redundancy within context condition tokens and the computational redundancy in context-latent interactions throughout the diffusion process. Based on these insights, we propose FullDiT2, an efficient in-context conditioning framework for general controllability in both video generation and editing tasks, which innovates from two key perspectives. Firstly, to address the token redundancy in context conditions, FullDiT2 leverages a dynamical token selection mechanism to adaptively identity important context tokens, reducing the sequence length for unified full-attention. Additionally, a selective context caching mechanism is devised to minimize redundant interactions between condition tokens and video latents throughout the diffusion process. Extensive experiments on six diverse conditional video editing and generation tasks demonstrate that FullDiT2 achieves significant computation reduction and 2-3 times speedup in averaged time cost per diffusion step, with minimal degradation or even higher performance in video generation quality.

FullDiT2 Showcase: Diverse Capabilities

Highlighting the visual quality and controllability of FullDiT2 across various video generation and editing tasks.

Showcasing: ID Insertion

FullDiT2 demonstrates high-fidelity insertion and can even outperform baselines in identity preservation for this task.

Sample 1

Reference Video

ID Reference

FullDiT2 Output

Sample 2

Reference Video

ID Reference

FullDiT2 Output

Sample 3

Reference Video

ID Reference

FullDiT2 Output

Sample 4

Reference Video

ID Reference

FullDiT2 Output

Sample 5

Reference Video

ID Reference

FullDiT2 Output

Sample 6

Reference Video

ID Reference

FullDiT2 Output