| CARVIEW |
PixelDiT
Pixel Diffusion Transformers
for Image Generation
†Project Lead and Main Advising
Say Goodbye to VAEs
Direct Pixel Space Optimization
Latent Diffusion Models (LDMs) like Stable Diffusion rely on a Variational Autoencoder (VAE) to compress images into latents. This process is lossy.
- × Lossy Reconstruction: VAEs blur high-frequency details (text, texture).
- × Artifacts: Compression artifacts can confuse the generation process.
- × Misalignment: Two-stage training leads to objective mismatch.
Pixel Models change the game:
- ✓ End-to-End: Trained and sampled directly on pixels.
- ✓ High-Fidelity Editing: Preserves details during editing.
- ✓ Simplicity: Single-stage training pipeline.
Method: Dual-Level Architecture
We introduce a Dual-Level DiT Architecture to make pixel-space diffusion efficient.
Our architecture employs Pixel Token Compaction to reduce the computational cost of attention over dense pixels, and Pixel-wise AdaLN to condition per-pixel updates on the global semantic context.
Content Consistency in Image Editing
PixelDiT edits stay faithful!
FlowEdit† exposes how VAE reconstructions from FLUX can warp fine text when tracing
the full flow path (nmin=0). The comparison shows PixelDiT keeping the brick-wall lettering intact because it denoises directly in pixel space—no lossy VAE, no baked-in artifacts.
We simply plug the pretrained PixelDiT flow into FlowEdit and obtain clean local edits without re-rendering the entire scene.
Prompt edit: “A bicycle parked on the sidewalk…” → “A motorcycle parked on the sidewalk…”
Performance
State-of-the-art on ImageNet 256×256
| Method | Space | Params | GFLOPs | gFID (↓) | IS (↑) | Recall (↑) |
|---|---|---|---|---|---|---|
| DiT-XL | Latent | 675M | 238 | 2.27 | 278.2 | 0.57 |
| SiT-XL | Latent | 675M | 238 | 2.06 | 270.3 | 0.59 |
| REPA (SiT-XL) | Latent | 675M | 238 | 1.42 | 305.7 | 0.65 |
| ADM-U | Pixel | 554M | 2240 | 4.59 | 186.7 | 0.52 |
| PixelFlow-XL | Pixel | 677M | 5818 | 1.98 | 282.1 | 0.60 |
| PixNerd-XL | Pixel | 700M | 268 | 1.93 | 298.0 | 0.60 |
| JiT-G | Pixel | 2B | 766 | 1.82 | 292.6 | 0.62 |
| PixelDiT-XL (Ours) | Pixel | 797M | 311 | 1.61 | 292.7 | 0.64 |
Comparison of class-conditioned generation on ImageNet 256×256. PixelDiT outperforms prior pixel-space models and closes the gap with latent models.
State-of-the-art on ImageNet 512×512
| Method | Space | Params | gFID (↓) | sFID (↓) | IS (↑) | Recall (↑) |
|---|---|---|---|---|---|---|
| DiT-XL | Latent | 675M | 3.04 | 5.02 | 240.8 | 0.54 |
| SiT-XL | Latent | 675M | 2.62 | 4.18 | 252.2 | 0.57 |
| REPA (SiT-XL) | Latent | 675M | 2.08 | 4.19 | 274.6 | 0.58 |
| ADM | Pixel | 554M | 3.85 | 5.86 | 221.7 | 0.53 |
| PixNerd-XL | Pixel | 700M | 2.84 | 5.95 | 245.6 | 0.59 |
| EPG | Pixel | 583M | 2.35 | — | 295.4 | 0.57 |
| JiT-H | Pixel | 956M | 1.94 | — | 309.1 | — |
| PixelDiT-XL (Ours) | Pixel | 797M | 1.80 | 5.53 | 279.4 | 0.66 |
Comparison of class-conditioned generation on ImageNet 512×512.
Competitive Text-to-Image Generation
512×512 Resolution
| Method | Space | GenEval (↑) | DPG (↑) |
|---|---|---|---|
| PixArt-α | Latent | 0.48 | 71.6 |
| PixArt-Σ | Latent | 0.52 | 79.5 |
| PixelFlow | Pixel | 0.60 | 77.9 |
| PixNerd | Pixel | 0.73 | 80.9 |
| PixelDiT-T2I | Pixel | 0.78 | 83.7 |
1024×1024 Resolution
| Method | Space | GenEval (↑) | DPG (↑) |
|---|---|---|---|
| PixArt-Σ | Latent | 0.54 | 80.5 |
| SDXL | Latent | 0.55 | 74.7 |
| DALLE 3 | Latent | 0.67 | 83.5 |
| FLUX-dev | Latent | 0.67 | 84.0 |
| PixelDiT-T2I | Pixel | 0.74 | 83.5 |
PixelDiT Podcast Deep Dive
A long-form conversation covering the motivation behind PixelDiT, architectural choices, and how pixel-space diffusion stacks up against latent pipelines.
Provided by AI Papers Slop (YouTube)
Podcast format explainer hosted on YouTube.
Gallery
Curated glimpses of PixelDiT across panoramas, portraits, still lifes, fashion stories and ImageNet samples. Tap any frame to reveal the original prompt and view the full-resolution render.
BibTeX
@article{yu2025pixeldit,
title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
author={Yu, Yongsheng and Xiong, Wei and Nie, Weili and Sheng, Yichen and Liu, Shiqiu and Luo, Jiebo},
journal={arXiv preprint arXiv:2511.20645},
year={2025}
}






