| CARVIEW |
Diffusion Models as Masked Autoencoders
DiffMAE
It has been a longstanding belief that generation can facilitate a true understanding of visual data.
In line with this, we revisit generatively pre-training visual representations in light of denoising diffusion models, and build connection between diffusion models and masked autoencoders.
In particular, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE). Our approach can:
- Serve as a strong initialization for downstream recognition tasks;
- Conduct generative image inpainting;
- Be effortlessly extended to video.
Random Mask Inpainting
The images are from ImageNet-1K validation set.
Use the slider to see generations from different inference steps.
masked input
ground-truth
Hover to view the masked inputs.
| ground-truth | DiffMAE |
| ground-truth | DiffMAE |
Center Mask Inpainting
The images are from ImageNet-1K validation set. Hover to view the masked inputs.
| ground-truth | DiffMAE |
| ground-truth | DiffMAE |
Video Inpainting
The videos are from Kinetics-400 validation set.
| ground-truth | inputs | DiffMAE |
| ground-truth | inputs | DiffMAE |
Fine-Tuning Generative Models
While being able to generatively inpaint images, DiffMAE is a strong self-supervised pre-training approach. The performance is:
- Comparable to leading self-supervised algorithms that focus solely on recognition;
- Stronger than other generative based algorithms by a large margin.
| pre-train | architecture | params. (M) | fine-tuned |
|---|---|---|---|
| MAE | ViT-L | 304 | 85.9 |
| iGPT | iGPT-L | 1362 | 72.6 |
| ADM | U-Net | 211 | 83.3 |
| DDPM | ViT-L | 304 | 83.4 |
| DiffMAE | ViT-L | 304 | 85.8 |
BibTeX
@inproceedings{wei2023diffusion,
author = {Wei, Chen and Mangalam, Karttikeya and Huang, Po-Yao and Li, Yanghao and Fan, Haoqi and Xu, Hu and Wang, Huiyu and Xie, Cihang and Yuille, Alan and Feichtenhofer, Christoph},
title = {Diffusion Models as Masked Autoencoder},
booktitle = {ICCV},
year = {2023},
}