| CARVIEW |
♻️ Cache Me if You Can: Accelerating Diffusion Models through Block Caching
Work done during Felix' internship at Meta GenAI
Abstract
Diffusion models have recently revolutionized the field of image synthesis due to their ability to generate photorealistic images. However, one of the major drawbacks of diffusion models is that the image generation process is costly. A large image-to-image network has to be applied many times to iteratively refine an image from random noise. While many recent works propose techniques to reduce the number of required steps, they generally treat the underlying denoising network as a black box. In this work, we investigate the behavior of the layers within the network and find that 1) the layers' output changes smoothly over time, 2) the layers show distinct patterns of change, and 3) the change from step to step is often very small. We hypothesize that many layer computations in the denoising network are redundant. Leveraging this, we introduce block caching, in which we reuse outputs from layer blocks of previous steps to speed up inference. Furthermore, we propose a technique to automatically determine caching schedules based on each block's changes over timesteps. In our experiments, we show through FID, human evaluation and qualitative analysis that Block Caching allows to generate images with higher visual quality at the same computational cost. We demonstrate this for different state-of-the-art models (LDM and EMU) and solvers (DDIM and DPM).
Method
Analysis
We observe, that in diffusion models, not only the intermediate results x, but also the internal feature
maps change
smoothly over time. (a) We visualize output feature maps of two layer blocks within the denoising network
via PCA. Structures change
smoothly at different rates. (b) We also observe this smooth layer-wise change when plotting the change in
output from one step to the
next, averaging over many different prompts and randomly initialized noise. Besides the average, we also
show the standard deviation as
shaded area. The patterns always remain the same. (Configuration: LDM-512, DPM, 20 Steps.)
We make three key observations:
- Smooth change over time. Similarly to the intermediate images during denoising, the blocks change smoothly and gradually over time. This suggests that there is a clear temporal relation between the outputs of a block.
- Distinct patterns of change. The different blocks do not behave uniformly over time. Rather, they apply a lot of change in certain periods of the denoising process, while they remain inactive in others. The standard deviation shows that this behavior is consistent over different images and random seeds. Note that some blocks, for example the blocks at higher resolutions (either very early or very late in the network) change most in the last 20%, while deeper blocks at lower resolutions change more in the beginning.
- Small step-to-step difference. Almost every block has significant periods during the denoising process, in which its output only changes very little.
Block Caching
We hypothesize that a lot of layer blocks are performing redundant computations during steps where their outputs change very little. To reduce the amount of redundant computations and to speed up inference, we propose Block Caching.
Automatic Cache Schedule
Not every block should be cached all the time. To make a more informed decision about when and where to cache, we rely on the change metrics visualized above. Our intuition is that for any layer block i, we retain a cached value, which was computed at time step ta , as long as the accumulated change does not exceed a certain threshold δ. Once the threshold is exceeded at time step tb , we recompute the block’s output.
Scale Shift Adjustment
To enable the model to adjust to using cached values, we introduce a very lightweight scale-shift adjustment mechanism wherever we apply caching. To this end, we add a timestep-dependent scalar shift and scale parameter for each layer that receives a cached input.
Results
EMU + Caching
Given a fixed computational budget, we can perform more denoising steps and obtain higher-quality results. Here, we compare EMU with our caching approach at 20 steps vs. 14 steps with the default setup. With identical inference speed, our caching technique produces finer details and more vibrant colors.
A magical portal opening to reveal a hidden realm of wonders.
A tranquil garden with cherry blossoms in full bloom under a full moon.
An ancient castle on a cliff overlooking a vast, mist-covered valley.
A yellow tiger with blue stripes.
A time-traveling wizard riding a mechanical steed through a portal, leaving trails of stardust in their wake.
A floating city in the clouds where airships navigate through tunnels of light, and majestic creatures soar in the skies.
Quantitative Results
We conduct a human evaluation study on the visual appeal of images generated with either the configuration with caching or the baseline without caching. We always compare configurations that have the same latency.
LDM + Caching
We show different configurations for the common LDM architecture. The caching configurations at 20 steps and the baseline configuration at 14 steps have the same latency. The baseline with 20 steps is about 1.5x slower. Our method often provides richer colors and finer details. Through our scale-shift adjustment, we avoid artifacts that are visible when naively applying block caching.
Quantitative Results
For different solvers, we test our caching technique against baselines with 1) the same number of steps or 2) the same latency. In all cases, our proposed approach achieves significant speedup while improving visual quality as measured by FID on a COCO subset removing all faces (for privacy reasons). Legend: SS = Scale-shift adjustment, Img/s. = Images per second.
BibTeX
@article{wimbauer2023cache,
title={Cache Me if You Can: Accelerating Diffusion Models through Block Caching},
author={Wimbauer, Felix and Wu, Bichen and Schoenfeld, Edgar and Dai, Xiaoliang and Hou, Ji and He, Zijian and Sanakoyeu, Artsiom and Zhang, Peizhao and Tsai, Sam and Kohler, Jonas and others},
journal={arXiv preprint arXiv:2312.03209},
year={2023}
}