Carview!

FastCache-xDiT: A Plug-and-Play Acceleration Method for Diffusion Transformers

📝 Papers | 🚀 Quick Start | 🎯 Supported DiTs | ⚡ FastCache

🔥 FastCache-xDiT

FastCache-xDiT is a novel plug-and-play acceleration method for Diffusion Transformers (DiTs) that exploits computational redundancies across both spatial and temporal dimensions. With zero training required and minimal quality impact, FastCache can deliver significant speedups (up to 1.7x) on modern DiT models while being fully compatible with existing parallel inference methods.

Key Features

Plug-and-Play: Drop-in acceleration with no model modifications required
Adaptive Computation: Dynamically adjusts caching behavior based on model hidden states
Spatial-Temporal Awareness: Intelligently identifies redundant computations in both dimensions
Memory Efficient: Reduces peak memory usage by avoiding unnecessary computations
Compatible with Parallel Methods: Can be combined with USP, PipeFusion, and other xDiT parallel techniques

FastCache introduces a hidden-state-level caching and compression framework with two core components:

Spatial Token Reduction Module - Adaptively identifies and processes only tokens with significant changes
Transformer-Level Caching Module - Uses statistical tests to determine when entire transformer blocks can be skipped

📊 Performance Comparison

FastCache delivers significant speedups across popular DiT models:

Model	Baseline	FastCache	TeaCache	First-Block-Cache
Flux.1	9.8s	6.2s (1.6x)	7.1s (1.4x)	7.5s (1.3x)
PixArt Sigma	10.6s	6.7s (1.6x)	6.9s/(1.6x)	6.8s/(1.6x)

🧠 Technical Details

FastCache-xDiT operates on two levels, using learnable parameters to approximate redundant computations:

Spatial Token Reduction

FastCache computes a motion-aware saliency metric by comparing hidden states between timesteps:

$$S_t^{(i)} = |X_t^{(i)} - X_{t-1}^{(i)}|_2^2$$

Each token is classified as either motion or static based on a threshold $\tau_s$:

$$\mathcal{M}_t = {i : S_t^{(i)} > \tau_s}, \quad X_t^m = X_t[\mathcal{M}_t], \quad X_t^s = X_t[\bar{\mathcal{M}}_t]$$

Motion tokens $X_t^m$ are processed through the full transformer stack, while static tokens $X_t^s$ are processed with a learnable linear projection:

$$H_t^s = W_s X_t^s + b_s$$

This spatial token reduction significantly reduces computation by only applying full transformer processing to tokens with significant changes.

Transformer-Level Caching

For each transformer block, FastCache computes a relative change metric between current and previous hidden states:

$$\delta_{t,l} = \frac{\Vert H_{t,l-1} - H_{t-1,l-1}\Vert_F}{\Vert H_{t-1,l-1}\Vert_F}$$

Under statistical assumptions, this metric follows a scaled $\chi^2$ distribution:

$$(ND) \cdot \delta_{t,l}^2 \sim \chi^2_{ND}$$

FastCache applies a cache decision rule: for confidence level $1-\alpha$, the transformer block is skipped if:

$$\delta_{t,l}^2 \leq \frac{\chi^2_{ND, 1-\alpha}}{ND}$$

Instead of computing the full transformer block, FastCache uses a block-specific learnable linear projection:

$$H_{t,l} = W_l H_{t,l-1} + b_l$$

This provides a statistically sound method to decide when hidden states can be reused, while the learnable parameters ensure output quality is maintained.

Adaptive Thresholding

FastCache includes an adaptive thresholding mechanism that adjusts based on timestep and variance:

$$\tau_{adaptive} = \beta_0 + \beta_1 S_{var} + \beta_2 t + \beta_3 t^2$$

Where $\beta_0$, $\beta_1$, $\beta_2$, and $\beta_3$ are parameters that control the adaptation, and $t$ is the timestep.

FastCache Algorithm

Algorithm: FastCache
Input: Hidden state H_t, previous hidden H_{t-1}, Transformer blocks, thresholds τ_s, α
Output: Processed hidden state H_t^L
1. Compute token-wise saliency S_t ← ||H_t - H_{t-1}||_2^2
2. Partition tokens into motion tokens X_t^m and static tokens X_t^s based on τ_s
3. Initialize H_{t,0} ← Concat(X_t^m, X_t^s)
4. For l = 1 to L:
   a. δ_{t,l} ← ||H_{t,l-1} - H_{t-1,l-1}||_F / ||H_{t-1,l-1}||_F
   b. If δ_{t,l}^2 ≤ χ^2_{ND, 1-α}/ND:
      i. H_{t,l} ← W_l H_{t,l-1} + b_l  (Linear approximation)
   c. Else:
      i. H_{t,l} ← Block_l(H_{t,l-1})  (Full computation)
5. Return H_t^L

This approach provides significant speedups (up to 1.7x) with minimal impact on generation quality by intelligently skipping redundant computations at both the token and transformer block levels.

🚀 QuickStart

1. Install xFuser

pip install xfuser  # Basic installation
pip install "xfuser[diffusers,flash-attn]"  # With both diffusers and flash attention

2. Using FastCache Acceleration

Python API

from xfuser.model_executor.pipelines.fastcache_pipeline import xFuserFastCachePipelineWrapper
from diffusers import PixArtSigmaPipeline
# Load your diffusion model
model = PixArtSigmaPipeline.from_pretrained("PixArt-alpha/PixArt-Sigma-XL-2-1024-MS")
# Create FastCache wrapper
fastcache_wrapper = xFuserFastCachePipelineWrapper(model)
# Enable FastCache with optional parameters
fastcache_wrapper.enable_fastcache(
    cache_ratio_threshold=0.05,  # Relative change threshold for caching
    motion_threshold=0.1,        # Threshold for motion saliency
)
# Run inference with FastCache acceleration
result = fastcache_wrapper(
    prompt="a photo of an astronaut riding a horse on the moon",
    num_inference_steps=30,
)
# Get cache statistics
stats = fastcache_wrapper.get_cache_statistics()
print(stats)

Command Line Usage

Run FastCache with PixArt Sigma:

# Basic usage
python examples/run_fastcache_test.py \
    --model_type pixart \
    --model "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS" \
    --prompt "a photo of an astronaut riding a horse on the moon" \
    --num_inference_steps 30 \
    --cache_method "Fast" \
    --cache_ratio_threshold 0.05 \
    --motion_threshold 0.1
# Using the convenience benchmark script to compare different cache methods
./examples/run_fastcache_benchmark.sh pixart

Run FastCache with Flux model:

# Basic usage
python examples/run_fastcache_test.py \
    --model_type flux \
    --model "black-forest-labs/FLUX.1-schnell" \
    --prompt "a serene landscape with mountains and a lake" \
    --num_inference_steps 30 \
    --cache_method "Fast" \
    --cache_ratio_threshold 0.05 \
    --motion_threshold 0.1
# Using the convenience benchmark script
./examples/run_fastcache_benchmark.sh flux

2. Using FastCache Acceleration Benchmark compared to other Acceleration Methods

python benchmark/cache_execute.py \
  --model_type pixart \
  --cache_methods None Fast Fb Tea \
  --num_inference_steps 20 \
  --height 512 \
  --width 512 \
  --output_dir cache_results

python benchmark/cache_execution_xfuser.py --model_type pixart --cache_methods Fast --num_inference_steps 5 --height 256 --width 256 --output_dir test_results

Command Line Arguments

Argument	Description	Default
`--model_type`	Model type (`pixart`, `flux`)	`pixart`
`--model`	Model path or name	`PixArt-alpha/PixArt-Sigma-XL-2-1024-MS`
`--prompt`	Text prompt for image generation	`a photo of an astronaut riding a horse on the moon`
`--num_inference_steps`	Number of inference steps	`30`
`--cache_method`	Cache method (`None`, `Fast`, `Fb`, `Tea`)	`Fast`
`--seed`	Random seed	`42`
`--height`	Image height	`768`
`--width`	Image width	`768`
`--cache_ratio_threshold`	Cache ratio threshold	`0.05`
`--motion_threshold`	FastCache motion threshold	`0.1`
`--output_dir`	Output directory for results	`fastcache_test_results`

3. Benchmark FastCache

Compare FastCache with other acceleration methods:

# Run on PixArt Sigma
./examples/run_fastcache_benchmark.sh pixart
# Run on Flux model
./examples/run_fastcache_benchmark.sh flux

The benchmark will:

Run baseline model without acceleration (cache_method="None")
Run with FastCache acceleration (cache_method="Fast")
Run with First-Block-Cache acceleration (cache_method="Fb")
Run with TeaCache acceleration (cache_method="Tea")
Generate comparison images for quality assessment
Create performance statistics and cache hit ratio charts
Generate a comprehensive HTML report with all comparisons

All results will be saved to the fastcache_benchmark_results directory, making it easy to compare the different caching methods in terms of both performance and output quality.

4. Combining with Parallel Methods

FastCache can be combined with xDiT's parallel methods for even greater speedups:

# Enable FastCache in your pipeline
fastcache_wrapper = xFuserFastCachePipelineWrapper(model)
fastcache_wrapper.enable_fastcache()
# Apply parallel inference (USP, PipeFusion, etc.)
engine_config, input_config = engine_args.create_config()
paralleler = xDiTParallel(fastcache_wrapper, engine_config, input_config)
# Run inference with both FastCache and parallel acceleration
result = paralleler(prompt="your prompt", num_inference_steps=30)

🎯 Supported DiTs

Model Name	FastCache	CFG	SP	PipeFusion	Performance Report Link
🟠 Flux	✅	NA	✔️	✔️	Report
🔴 PixArt-Sigma	✅	✔️	✔️	✔️	Report
🎬 HunyuanDiT-v1.2-Diffusers	✅	✔️	✔️	✔️	Report
🎬 StepVideo	✅	NA	✔️	❎	Report
🟢 PixArt-alpha	✅	✔️	✔️	✔️	Report
🎬 ConsisID-Preview	✅	✔️	✔️	❎	Report
🎬 CogVideoX1.5	✅	✔️	✔️	❎	Report

📈 FastCache-xDiT's Cache Methods

FastCache-xDiT is fully compatible with the parallel acceleration methods provided by xDiT, this repo offers multiple cache-based acceleration methods:

FastCache: Our adaptive spatial-temporal caching method that uses motion-aware token reduction and statistical caching to exploit computational redundancies. Read more about FastCache.
TeaCache: Memory-friendly caching and generation to exploit redundancies between adjacent denoising steps.
First-Block-Cache: Caches the output of early transformer blocks across timesteps.

Using FastCache for Video Generation

FastCache also works with video generation models. Here are examples of how to use it with different video DiT models:

# Using FastCache with StepVideo
python benchmark/video_cache_execute.py \
  --model_type stepvideo \
  --prompt "a dog running in a field" \
  --cache_methods Fast \
  --num_inference_steps 20 \
  --num_frames 16 \
  --height 256 \
  --width 256 \
  --cache_ratio_threshold 0.15
# Using FastCache with CogVideoX
python benchmark/video_cache_execute.py \
  --model_type cogvideox \
  --prompt "a dog running in a field" \
  --cache_methods Fast \
  --num_inference_steps 20 \
  --num_frames 16
# Using FastCache with ConsisID
python benchmark/video_cache_execute.py \
  --model_type consisid \
  --prompt "a time lapse of a blooming flower" \
  --cache_methods Fast \
  --num_inference_steps 20 \
  --num_frames 16
# Compare all cache methods (None, Fast, Fb, Tea) on video generation
python benchmark/video_cache_execute.py \
  --model_type stepvideo \
  --cache_methods All \
  --num_frames 16 \
  --num_inference_steps 20

📚 Develop Guide

We provide a step-by-step guide for adding new models, please refer to the following tutorial.

📝 Cite Us

If you use FastCache-xDiT in your research or applications, please cite:

@inproceedings{liu2025fastcache,
  title={FastCache: Cache What Matters, Skip What Doesn't.},
  author={Liu, Dong and Zhang, Jiayi and Li, Yifan and Yu, Yanxuan and Lengerich, Ben and Wu, Ying Nian},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
  year={2025}
}
@article{liu2025fastcache,
  title={FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation},
  author={Liu, Dong and Zhang, Jiayi and Li, Yifan and Yu, Yanxuan and Lengerich, Ben and Wu, Ying Nian},
  journal={arXiv preprint arXiv:2505.20353},
  year={2025}
}

Contact

For questions about FastCache-xDiT, please contact dong.liu.dl2367@yale.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 652 Commits
.github/workflows		.github/workflows
assets		assets
benchmark		benchmark
cache_benchmark_results		cache_benchmark_results
cache_execute_results		cache_execute_results
cache_results		cache_results
docker		docker
docs		docs
entrypoints		entrypoints
examples		examples
tests		tests
video_cache_results		video_cache_results
xfuser		xfuser
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE.txt		LICENSE.txt
README.md		README.md
pytest.ini		pytest.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FastCache-xDiT: A Plug-and-Play Acceleration Method for Diffusion Transformers

Table of Contents

🔥 FastCache-xDiT

Key Features

📊 Performance Comparison

🧠 Technical Details

Spatial Token Reduction

Transformer-Level Caching

Adaptive Thresholding

FastCache Algorithm

🚀 QuickStart

1. Install xFuser

2. Using FastCache Acceleration

Python API

Command Line Usage

2. Using FastCache Acceleration Benchmark compared to other Acceleration Methods

Command Line Arguments

3. Benchmark FastCache

4. Combining with Parallel Methods

🎯 Supported DiTs

📈 FastCache-xDiT's Cache Methods

Using FastCache for Video Generation

📚 Develop Guide

📝 Cite Us

Contact

About

Uh oh!

Releases

Packages

Languages

License

NoakLiu/FastCache-xDiT

Folders and files

Latest commit

History

Repository files navigation

FastCache-xDiT: A Plug-and-Play Acceleration Method for Diffusion Transformers

Table of Contents

🔥 FastCache-xDiT

Key Features

📊 Performance Comparison

🧠 Technical Details

Spatial Token Reduction

Transformer-Level Caching

Adaptive Thresholding

FastCache Algorithm

🚀 QuickStart

1. Install xFuser

2. Using FastCache Acceleration

Python API

Command Line Usage

2. Using FastCache Acceleration Benchmark compared to other Acceleration Methods

Command Line Arguments

3. Benchmark FastCache

4. Combining with Parallel Methods

🎯 Supported DiTs

📈 FastCache-xDiT's Cache Methods

Using FastCache for Video Generation

📚 Develop Guide

📝 Cite Us

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages