ControlNets are widely used for adding spatial control in image generation with different conditions, such as depth maps, canny edges, and human poses. However, there are several challenges when leveraging the pretrained image ControlNets for controlled video generation. First, pretrained ControlNet cannot be directly plugged into new backbone models due to the mismatch of feature spaces, and the cost of training ControlNets for new backbones is a big burden for many users. Second, ControlNet features for different frames might not effectively handle the temporal consistency of objects.
To address these challenges, we introduce Ctrl-Adapter, an efficient and versatile framework that adds diverse controls to any image/video diffusion models, by adapting pretrained ControlNets (and improving temporal alignment for videos). Ctrl-Adapter provides strong and diverse capabilities including image control, video control, video control with sparse frames, multi-condition control, compatibility with different backbone models, adaptation to unseen control conditions, and video editing. In the Ctrl-Adapter framework, we train adapter layers that fuse pretrained ControlNet features to different image/video diffusion models, while keeping the parameters of the ControlNets and the diffusion models frozen. Ctrl-Adapter consists of temporal as well as spatial modules so that it can effectively handle the temporal consistency of videos. Additionally, for robust adaptation to different backbone models and sparse control, we propose latent skipping and inverse timestep sampling. Moreover, Ctrl-Adapter enables control from multiple conditions by simply taking the (weighted) average of ControlNet outputs.
From our experiments with diverse image and video diffusion backbones (SDXL, Hotshot-XL, I2VGen-XL, and SVD), Ctrl-Adapter matches ControlNet on the COCO dataset for image control, and even outperforms all baselines for video control (achieving the state-of-the-art accuracy on the DAVIS 2017 dataset) with significantly lower computational costs (Ctrl-Adapter outperforms baselines in less than 10 GPU hours). Lastly, we provide comprehensive ablations for our design choices and qualitative examples.
Efficient Adaptation of Pretrained ControlNets. As shown in the left figure, we train an adapter module (colored orange) to map the middle/output blocks of a pretrained ControlNet (colored blue) to the corresponding middle/output blocks of the target video diffusion model (colored green). We keep all parameters in both the ControlNet and the target video diffusion model frozen. Therefore, training a Ctrl-Adapter can be significantly more efficient than training a new video ControlNet.
Ctrl-Adapter architecture. As shown in the right figure, each block of Ctrl-Adapter consists of four modules: spatial convolution, temporal convolution, spatial attention, and temporal attention. The temporal convolution and attention modules model effectively fuse the in ControlNet features for better temporal consistency. When adapting to image diffusion models, Ctrl-Adapter blocks only consist of spatial convolution/attention modules (without temporal convolution/attention modules).
Control | Generated Video | |
---|---|---|
|
|
Control | Generated Video | |
---|---|---|
|
|
Control | Generated Video | |
---|---|---|
|
|
Control | Generated Video | |
---|---|---|
|
|
Control | Generated Video | |
---|---|---|
|
|
Control | Generated Video | |
---|---|---|
|
|
Control | Generated Video | |
---|---|---|
|
|
Control | Generated Video | |
---|---|---|
|
|
Controls | Generated Video | |
---|---|---|
|
|
Controls | Generated Video | |
---|---|---|
|
|
Controls | Generated Video | |
---|---|---|
|
|
Controls | Generated Video | |
---|---|---|
|
|
(1) Control Condition Extraction
Input Prompt | (2) Generated Frame (Generated by SDXL + Ctrl-Adapter) |
(3) Generated Video (Generated by I2VGen-XL + Ctrl-Adapter) |
|||
---|---|---|---|---|---|
A camel with rainbow fur walking. |
|
|
|||
A zebra stripped camel walking. |
|
|
|||
A camel walking, ink sketch style. |
|
|
|||
A camel walking, van gogh-style. |
|
|
Sparse Inputs (Condition is given for 4 out of 16 frames) |
Generated Video | |
---|---|---|
...
|
|
Sparse Inputs (Condition is given for 4 out of 16 frames) |
Generated Video | |
---|---|---|
...
|
|
Condition | Controls | Generated Video | |
---|---|---|---|
Training: Depth Map Inference: Normal Map |
|
|
Condition | Controls | Generated Video | |
---|---|---|---|
Training: Depth Map Inference: Line art |
|
|
Condition | Controls | Generated Video | |
---|---|---|---|
Training: Depth Map Inference: Softedge |
|
|
Prompt | Control | Generated Image | |
---|---|---|---|
Cute fluffy corgi dog in the city in anime style |
|
|
|
happy Hulk standing in a beautiful field of flowers, colorful flowers everywhere, perfect lighting, leica summicron 35mm f2.0, Kodak Portra 400, film grain |
|
|
|
Astronaut walking on water |
|
|
|
a cute mouse pilot wearing aviator goggles, unreal engine render, 8k |
|
|
|
Cute lady frog in dress and crown dressed in gown in cinematic environment |
|
|
|
A cute sheep with rainbow fur, photo |
|
|
|
Cute and super adorable mouse in black and red chef coat and chef hat, holding a steaming entree |
|
|
|
a cute, happy hedgehog taking a bite from a piece of watermelon, eyes closed, cute ink sketch style illustration |
|
|
Overview of the capabilities supported by recent methods for controlled image/video generation. Ctrl-Adapter supports diverse capabilities including image control, video control, video control with sparse frames, multi-condition control, compatibility with different backbone models, adaptation to unseen control conditions, and video editing, while previous methods support only some of them.
Although the original ControlNets take the latent as part of their inputs, we find that skipping from ControlNet inputs is effective for Ctrl-Adapter when (1) adpating to backbones diffusion models with different noise scales and (2) video generation with sparse frame conditions (i.e., conditions are only provided for the subset of video frames).
For more effective spatial control beyond a single condition, we can easily combine the control features of multiple ControlNets via Ctrl-Adapter. We first average the ControlNet output features via learnable scalar weights for each Ctrl-Adapter block, then provide such fused ControlNet features as input to a shared and unified Ctrl-Adapter. The weighted-average approach can be understood as a lightweight mixture-of-experts (MoE).
Left: Evaluation on video control with a single condition on DAVIS 2017 dataset. Right: Evaluation on image control with a single condition on COCO dataset. We demonstrate that Ctrl-Adapter matches the performance of a pretrained image ControlNet and outperforms previous methods in controllable video generation (achieving state-of-the-art performance on the DAVIS 2017 dataset) with significantly lower computational costs (Ctrl-Adapter outperforms baselines in less than 10 GPU hours).
More conditions improve spatial control in video generation. The proposed linear weight outperforms the equal-weight approach as the number of conditions increases. The control sources are abbreviated as D (depth map), C (canny edge), N (surface normal), S (softedge), Seg (semantic segmentation map), L (line art), and P (human pose).
Training speed of Ctrl-Adapter for video (left) and image (right) control with depth map. The training GPU hours are measured with A100 80GB GPUs. For both video and image controls, Ctrl-Adapter outperforms strong baselines in less than 10 GPU hours.
We find that skipping the latents from ControlNet inputs helps Ctrl-Adapter for (1) adaptation to backbone models with different noise scales and (2) video control with sparse frame conditions.
Our framework is primarily for research purposes (and therefore should be used with caution in real-world applications).
Note that the performance/quality/visual artifacts of Ctrl-Adapter largely depend on the capabilities (e.g., motion styles and video length) of the current open-source backbone video/image diffusion models used.
@article{Lin2024CtrlAdapter,
author = {Han Lin and Jaemin Cho and Abhay Zala and Mohit Bansal},
title = {Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model},
year = {2024},
}