| CARVIEW |
CTRL-Adapter:
An Efficient and Versatile Framework
for Adapting Diverse Controls to Any Diffusion Model
Abstract
ControlNets are widely used for adding spatial control to text-to-image diffusion models with different conditions, such as depth maps, scribbles/sketches, and human poses. However, when it comes to controllable video generation, ControlNets cannot be directly integrated into new backbones due to feature space mismatches, and training ControlNets for new backbones can be a significant burden for many users. Furthermore, applying ControlNets independently to different frames cannot effectively maintain object temporal consistency.
To address these challenges, we introduce Ctrl-Adapter, an efficient and versatile framework that adds diverse controls to any image/video diffusion model through the adaptation of pretrained ControlNets. Ctrl-Adapter offers strong and diverse capabilities, including image and video control, sparse-frame video control, fine-grained patch-level multi-condition control (via an MoE router), zero-shot adaptation to unseen conditions, and supports a variety of downstream tasks beyond spatial control, including video editing, video style transfer, and text-guided motion control. With six diverse U-Net/DiT-based image/video diffusion models (SDXL, PixArt-α, I2VGen-XL, SVD, Latte, Hotshot-XL), Ctrl-Adapter matches the performance of pretrained ControlNets on COCO and achieves the state-of-the-art on DAVIS 2017 with significantly lower computation (< 10 GPU hours).
Method (see more details below ↓)
Efficient Adaptation of Pretrained ControlNets. As shown in the left figure, we train an adapter module (colored orange) to map the middle/output blocks of a pretrained ControlNet (colored blue) to the corresponding middle/output blocks of the target video diffusion model (colored green). We keep all parameters in both the ControlNet and the target video diffusion model frozen. Therefore, training a Ctrl-Adapter can be significantly more efficient than training a new video ControlNet.
Ctrl-Adapter architecture. As shown in the right figure, each block of Ctrl-Adapter consists of four modules: spatial convolution, temporal convolution, spatial attention, and temporal attention. The temporal convolution and attention modules model effectively fuse the in ControlNet features for better temporal consistency. When adapting to image diffusion models, Ctrl-Adapter blocks only consist of spatial convolution/attention modules (without temporal convolution/attention modules).
Generated Examples
We show examples from both U-Net based models (I2V-GenXL & SDXL), and DiT based models (Latte & Pixart-α)
Video Generation with Condition Control (w/ I2V-GenXL; U-Net based)
"A fish swimming"
+
| Control | Generated Video | |
|---|---|---|
|
|
|
+
| Control | Generated Video | |
|---|---|---|
|
|
|
with pearlescent, silver-edged scales,
icy blue eyes, elegantivory horns, and
misty breath. Focus on detailed facial
features and textured scales, set
against a softly blurred background"
+
| Control | Generated Video | |
|---|---|---|
|
|
|
"A car flies over a hill"
+
| Control | Generated Video | |
|---|---|---|
|
|
|
is seen happily darting through a
dense garden, as if chasing something.
Its eyes are wide and happy
as it jogs forward, scanning the
branches, flowers, and leaves as it
walks. The path is narrow as
it makes its way between all
the plants."
+
| Control | Generated Video | |
|---|---|---|
|
|
|
"A bird flying over a forest."
+
| Control | Generated Video | |
|---|---|---|
|
|
|
snow-covered houses, glowing
windows, decorated trees, festive
snowmen, and tiny figurines in a
quaint, holiday-themed diorama
evoking a cozy, celebratory winter atmosphere"
+
| Control | Generated Video | |
|---|---|---|
|
|
|
"A woman wearing blue jeans and a
white t-shirt taking a pleasant stroll in
Mumbai India during a beautiful sunset"
+
| Control | Generated Video | |
|---|---|---|
|
|
|
Video Generation with Condition Control (w/ Latte; DiT based)
"A 2d abstract japanese animation where drops of ink in water form into lifelike creatures that swim and interact with each other, creating an ethereal underwater world made entirely of flowing, merging colors"
| Control | Generated Video | |
|---|---|---|
|
|
|
"A giant, towering cloud in the shape of a man looms over the earth. The cloud man shoots lighting bolts down to the earth"
| Control | Generated Video | |
|---|---|---|
|
|
|
"A medium sized friendly looking dog walks through an industrial parking lot. The environment is foggy and cloudy. Shot on 35mm film, vivid colors."
| Control | Generated Video | |
|---|---|---|
|
|
|
Video Generation with Multiple Control Conditions
in shallow ocean waters along the beach"
+
| Controls | Generated Video | |
|---|---|---|
|
|
|
"A man dancing"
+
| Controls | Generated Video | |
|---|---|---|
|
|
|
taking a pleasant stroll in Johannesburg South Africa
during a beautiful sunset"
+
| Controls | Generated Video | |
|---|---|---|
|
|
|
a bench, wears a casual outfit and a beanie,
displaying focus and athletic skill"
+
| Controls | Generated Video | |
|---|---|---|
|
|
|
Video Editing via Combining Image and Video Ctrl-Adapters
(1) Control Condition Extraction
| Input Prompt | (2) Generated Frame (Generated by SDXL + Ctrl-Adapter) |
(3) Generated Video (Generated by I2VGen-XL + Ctrl-Adapter) |
|||
|---|---|---|---|---|---|
|
A camel with rainbow fur walking. |
|
|
|
|
|
A zebra stripped camel walking. |
|
|
|
|
|
A camel walking, ink sketch style. |
|
|
|
|
|
A camel walking, van gogh-style. |
|
|
|
|
Text-Guided Motion Control
| Initial Frame | Object Masking | Input Prompt | Generated Video (Generated by I2VGen-XL + Ctrl-Adapter) |
|||
|---|---|---|---|---|---|---|
|
|
|
|
A white and orange tabby alley cat is seen darting across a back street alley in a heavy rain, looking for shelter. |
|
|
|
|
|
|
A white and orange tabby cat is darting through a dense garden, as if chasing something |
|
|
|
|
|
|
An elk with impressive antlers grazing on the snow-covered ground |
|
|
Video Style Transfer
| Initial Frame | Shuffled | Input Prompt | Generated Video (Generated by I2VGen-XL + Ctrl-Adapter) |
|||
|---|---|---|---|---|---|---|
|
|
|
|
A miniature Christmas village with snow-covered houses, glowing windows, decorated trees, festive snowmen, and tiny figurines in a quaint, holiday-themed diorama evoking a cozy, celebratory winter atmosphere |
|
|
|
|
|
|
Stop motion of a colorful paper flower blooming |
|
|
|
|
|
|
Beautiful, snowy Tokyo city is bustling |
|
|
Video Generation with Sparse Frames as Control Condition
and sculptures and beautiful works of art in all styles"
+
| Sparse Inputs (Condition is given for 4 out of 16 frames) |
Generated Video | |
|---|---|---|
...
|
|
|
+
| Sparse Inputs (Condition is given for 4 out of 16 frames) |
Generated Video | |
|---|---|---|
...
|
|
|
Zero-Shot Generalization on Unseen Conditions
cowboy boots taking a pleasant stroll in
Mumbai India during a beautiful sunset"
+
| Condition | Controls | Generated Video | |
|---|---|---|---|
|
Training: Depth Map Inference: Normal Map |
|
|
|
pondering the history of the universe. He sits at a cafe in Paris, his eyes focus on people offscreen. As they walk, he sits mostly motionless, he is dressed in a wool coat suit coat.
With a button-down shirt, he wears a brown beret and glasses."
+
| Condition | Controls | Generated Video | |
|---|---|---|---|
|
Training: Depth Map Inference: Line art |
|
|
|
The background is blurred, drawing attention to the animal's striking appearance.
The chameleon's vibrant colors and unique texture are the focus of this shot."
+
| Condition | Controls | Generated Video | |
|---|---|---|---|
|
Training: Depth Map Inference: Softedge |
|
|
|
Image Generation with Condition Control (w/ SDXL; U-Net based)
| Prompt | Control | Generated Image | |
|---|---|---|---|
|
Cute fluffy corgi dog in the city in anime style |
|
|
|
|
happy Hulk standing in a beautiful field of flowers, colorful flowers everywhere, perfect lighting, leica summicron 35mm f2.0, Kodak Portra 400, film grain |
|
|
|
|
Astronaut walking on water |
|
|
|
|
a cute mouse pilot wearing aviator goggles, unreal engine render, 8k |
|
|
|
|
Cute lady frog in dress and crown dressed in gown in cinematic environment |
|
|
|
|
A cute sheep with rainbow fur, photo |
|
|
|
|
Cute and super adorable mouse in black and red chef coat and chef hat, holding a steaming entree |
|
|
|
|
a cute, happy hedgehog taking a bite from a piece of watermelon, eyes closed, cute ink sketch style illustration |
|
|
|
Image Generation with Condition Control (w/ Pixart-α; DiT based)
| Prompt | Control | Generated Image | |
|---|---|---|---|
|
A plate of cheesecake, pink flowers everywhere, cinematic lighting, food photography |
|
|
|
|
Darth Vader in a beautiful field of flowers, colorful flowers everywhere, perfect lighting |
|
|
|
|
A micro-tiny clay pot full of dirt with a beautiful daisy planted in it, shining in the autumn sun |
|
|
|
|
A raccoon family having a nice meal, life-like |
|
|
|
Comparison to other methods
Overview of the capabilities supported by recent methods for controlled image/video generation. Ctrl-Adapter supports diverse capabilities including image control, video control, video control with sparse frames, multi-condition control, compatibility with different backbone models, while previous methods support only support a small subset of them.
Skipping the latent from ControlNet inputs: robust adaption to different noise scales & sparse frame conditions.
Although the original ControlNets take the latent as part of their inputs, we find that skipping from ControlNet inputs is effective for Ctrl-Adapter when (1) adpating to backbones diffusion models with different noise scales and (2) video generation with sparse frame conditions (i.e., conditions are only provided for the subset of video frames).
Video Generation from Multiple Conditions
For more effective spatial control beyond a single condition, we can easily combine the control features of multiple ControlNets via Ctrl-Adapter. For this, we propose a lightweight mixture-of-experts (MoE) router that takes patch-level inputs and assigns weights to each condition. In our experiments, we find that our MoE router is more effective than equal weights or unconditinoal global weights.
Evaluation on Video Control and Image Control with Single Condition
Left: Evaluation on video control with a single condition on DAVIS 2017 dataset. Right: Evaluation on image control with a single condition on COCO dataset. We demonstrate that Ctrl-Adapter matches the performance of a pretrained image ControlNet and outperforms previous methods in controllable video generation (achieving state-of-the-art performance on the DAVIS 2017 dataset) with significantly lower computational costs (Ctrl-Adapter outperforms baselines in less than 10 GPU hours).
Evaluation on Video Control with Multiple Conditions
More conditions improve spatial control in video generation. The proposed patch-level weighting method outperforms the equal-weight approach and unconditional global weighting. The control sources are abbreviated as D (depth map), C (canny edge), N (surface normal), S (softedge), Seg (semantic segmentation map), L (line art), and P (human pose).
Training Efficiency of Ctrl-Adapter
Training speed of Ctrl-Adapter for video (left) and image (right) control with depth map. The training GPU hours are measured with A100 80GB GPUs. For both video and image controls, Ctrl-Adapter outperforms strong baselines in less than 10 GPU hours.
Ablation Study - Skipping Latents from ControlNet Inputs
We find that skipping the latents from ControlNet inputs helps Ctrl-Adapter for (1) adaptation to backbone models with different noise scales and (2) video control with sparse frame conditions.
Limitations
Our framework is primarily for research purposes (and therefore should be used with caution in real-world applications).
Note that the performance/quality/visual artifacts of Ctrl-Adapter largely depend on the capabilities (e.g., motion styles and video length) of the current open-source backbone video/image diffusion models used.
BibTeX
@article{Lin2024CtrlAdapter,
author = {Han Lin and Jaemin Cho and Abhay Zala and Mohit Bansal},
title = {Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model},
year = {2024},
}