| CARVIEW |
While SDS tries to model this optimal path,
it is a per-iteration approximation of it—and this ultimately causes
its characteristic artifacts.
First-Order Approximation Error
Instead of training a model to solve this problem directly, we use pre-trained diffusion models to approximate the bridge. This requires solving two PF-ODEs, which invoke dozens of function evaluations (NFEs) to estimate the gradient at each iteration. Instead, SDS uses a single step estimate, which is more practical, but in turn less accurate. Recent works ISM [2] and SDI [3] can be interpreted as reducing this error with a multi-step simulation.
Source Distribution Mismatch Error
Estimating the Schrödinger Bridge relies on \(\mathrm{\epsilon}_{\phi, \text{src}}\) accurately estimating the distribution of the current sample, \(x_{\theta}\). SDS (under high CFG) uses the uncondtional distribution as a proxy for the current distribution, contributing to its characteristic artifacts. A series of works can be viewed as improving this error [4, 5, 6].
A Unified Framework
We show that this unified framework can be applied to a number of different methods—it explains VSD's high quality results, as well as the shortcomings of other, more efficient methods. Check out our paper for a more thorough analysis.
A fast, but effective alternative
We know that pre-trained diffusion models understand the distributions of high quality and corrupted images and their correspondence with natural language. So, by simply describing image corruptions with a text prompt, we can try to better model our original source distribution.
Instead of approximating the current distribution with the unconditional distribution as in SDS, we can instead use this negatively prompted conditional distribution to better model the type of artifacts our optimized image variables might be experiencing. This simple change considerably improves results.
Text-based Image Optimization
COCO-FID=86.02 | Time: 4.48min
COCO-FID=91.70 | Time: 7.20min
COCO-FID=89.96 | Time: 6.21min
COCO-FID=59.22 | Time: 16.02min
COCO-FID=55.65 | Time: 21.46min
COCO-FID=67.89 | Time: 4.48min
Text-based NeRF Optimization
VSD
SDS
Ours
Painting-to-Real
We examine our method’s ability to serve as a general-purpose realism prior. An effective image prior should guide a painting toward a nearby natural image through optimization. We simply append the negative descriptor "painting" for our gradient and initialize from the painting. Slide your mouse across to see the difference!
Acknowledgment
We thank Matthew Tancik, Jiaming Song, Riley Peterlinz, Ayaan Haque, Ethan Weber, Konpat Preechakul, Ruiqi Gao, Amit Kohli and Ben Poole for their helpful feedback and discussion.
This project is supported in part by a Google Research Scholar award and IARPA DOI/IBC No. 140D0423C0035. The
views and conclusions contained herein are those of the authors and do not
represent the official policies or endorsements of these institutions.
BibTeX
@inproceedings{mcallister2024rethinking,
title={Rethinking Score Distillation as a Bridge Between Image Distributions},
author={David McAllister and Songwei Ge and Jia-Bin Huang and David W. Jacobs and Alexei A. Efros and Aleksander Holynski and Angjoo Kanazawa},
booktitle={Advances in Neural Information Processing Systems},
year={2024}
}