| CARVIEW |
Select Language
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Fri, 09 Aug 2024 05:14:44 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"66b5a5c4-3cc0"
expires: Tue, 30 Dec 2025 01:58:43 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 8EE1:2D8B9D:975972:AA042C:69532F7A
accept-ranges: bytes
date: Tue, 30 Dec 2025 01:48:43 GMT
via: 1.1 varnish
age: 0
x-served-by: cache-bom-vanm7210050-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767059323.074784,VS0,VE205
vary: Accept-Encoding
x-fastly-request-id: 5611a92c3c0ba500a4dd03b746640b79c19a8341
content-length: 4212
Generative Powers of Ten
We also would like to thank Ben Poole, Jon Barron, Luyang Zhu, Ruiqi Gao, Tong He, Grace Luo, Angjoo Kanazawa, Vickie Ye, Songwei Ge, Keunhong Park, and David Salesin for helpful discussions and feedback.
Generative Powers of Ten
Xiaojuan Wang1
Janne Kontkanen3
Brian Curless1
Steve Seitz1
Ira Kemelmacher1
Ben Mildenhall3 Pratul Srinivasan3 Dor Verbin3 Aleksander Holynski2, 3
Ben Mildenhall3 Pratul Srinivasan3 Dor Verbin3 Aleksander Holynski2, 3
1University of Washington
2UC Berkeley
3Google Research
CVPR 2024 (Highlight)
Abstract
We present a method that uses a text-to-image model to generate consistent content across multiple image scales, enabling extreme semantic zooms into a scene, e.g., ranging from a wide-angle landscape view of a forest to a macro shot of an insect sitting on one of the tree branches. This representation allows us to render continuously zooming videos, or explore different scales of the scene interactively. We achieve this through a joint multi-scale diffusion sampling approach that encourages consistency across different scales while preserving the integrity of each individual sampling process. Since each generated scale is guided by a different text prompt, our method enables deeper levels of zoom than traditional super-resolution methods that may struggle to create new contextual structure at vastly different scales. We compare our method qualitatively with alternative techniques in image super-resolution and outpainting, and show that our method is most effective at generating consistent multi-scale content.
Method
Our method uses a pre-trained diffusion model to jointly denoise multiple images of the scene at various scales. Noisy images from each zoom level, along with the respective prompts are simultaneously fed into the same pretrained diffusion model, returning estimates of the corresponding clean images. These images may have inconsistent estimates for the overlapping regions that they all observe. We employ multi-resolution blending to fuse these regions into a consistent zoom stack and re-render the different zoom levels from the consistent representation. These re-rendered images are then used as the clean image estimates in the DDPM sampling step.
More Results
Zooming into a real image
We can guide one zoom level to match an input image, allowing us to zoom into a real image.
Diversity
By varying the seed, we can get different results for the same set of input prompts.
Baseline Comparisons
Another way to generate a zooming video is to either (1) progressively super-resolve a zoomed-out image with a text-conditioned super-resolution model or (2) progressively outpaint a zoomed-in image with a text-conditioned outpainting model. Here we compare with these two variants, using Stable Diffusion's super-resolution and outpainting models. We observe that causal generation typically results in inferior results, since prior generations are not always compatible with subsequent zoom levels.
| Stable Diffusion Super-Resolution | Stable Diffusion Outpainting | Ours |
| Stable Diffusion Super-Resolution | Stable Diffusion Outpainting | Ours |
Acknowledgements
This research project is inspired by the original 1977 Powers of Ten film, which originally showcased this type of continuous zoom effect. Our goal in this project is to create a similar animation automatically with a generative model, and also to enable the creation of these zoom videos from our own photos.We also would like to thank Ben Poole, Jon Barron, Luyang Zhu, Ruiqi Gao, Tong He, Grace Luo, Angjoo Kanazawa, Vickie Ye, Songwei Ge, Keunhong Park, and David Salesin for helpful discussions and feedback.
BibTeX
@article{wang2023generativepowers,
title={Generative Powers of Ten},
author={Xiaojuan Wang and Janne Kontkanen and Brian Curless and Steve Seitz and Ira Kemelmacher
and Ben Mildenhall and Pratul Srinivasan and Dor Verbin and Aleksander Holynski},
journal={arXiv preprint arXiv:2312.02149},
year={2023}
}