| CARVIEW |
TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models
Abstract
Diffusion models have opened the path to a wide range of text-based image editing frameworks. However, these typically build on the multi-step nature of the diffusion backwards process, and adapting them to distilled, fast- sampling methods has proven surprisingly challenging. Here, we focus on a popular line of text-based editing frameworks - the “edit-friendly” DDPM-noise inversion approach. We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength. We trace the artifacts to mismatched noise statistics between inverted noises and the expected noise schedule, and suggest a shifted noise schedule which corrects for this offset. To increase editing strength, we propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts. All in all, our method enables text-based image editing with as few as three diffusion steps, while providing novel insights into the mechanisms behind popular text-based editing approaches.
What is this about?
We trace the artifacts to misaligned noise statistics, and propose a time-shifting method to correct them. To improve editing strength, we analyze the Edit-Friendly equations and show that they can be broken into two components - one responsible for shifting the image between prompts, and one for shifting it between diffusion trajectories. We rescale the cross-prompt term and demonstrate that it increases editability without introducing novel artifacts. For additional details please see the paper.
Fixing Visual Artifacts
We observe that the misaligned statistics are roughly time-shifted, with noise statistics matching the expected values at roughly 200 steps earlier. Hence, we simply provide both the scheduler and the model with a timestep parameter which is also shifted by 200 steps, eliminating the domain gap.
Pseudo-guidance
Edit Friendly and Delta Denoising Score equivalence
Our investigation into the Edit-Friendly DDPM proccess reveals that it shares similar form to the corrections employed by Delta Denoising Score. Surprisingly, we prove that under an appropriate choice of learning rates and time-step sampling, the two methods are functionally equivalent and create the exact same results. This finding can also be extended to the recent Posterior Distillation Sampling (PDS) method, if applied to image editing.
Results
Comparisons to Prior Work (Multi-step)
Comparisons to Prior Work (Few-step)
More results
BibTeX
If you find our work useful, please cite our paper:
@misc{deutch2024turboedittextbasedimageediting,
title={TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models},
author={Gilad Deutch and Rinon Gal and Daniel Garibi and Or Patashnik and Daniel Cohen-Or},
year={2024},
eprint={2408.00735},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.00735},
}