| CARVIEW |
MegaScenes: Scene-Level View Synthesis at Scale
Abstract
Scene-level novel view synthesis (NVS) is fundamental to many vision and graphics applications. Recently, pose-conditioned diffusion models have led to significant progress by extracting 3D information from 2D foundation models, but these methods are limited by the lack of scene-level training data. Common dataset choices either consist of isolated objects (Objaverse), or of object-centric scenes with limited pose distributions (DTU, CO3D). In this paper, we create a large-scale scene-level dataset from Internet photo collections, called MegaScenes, which contains over 100K SfM reconstructions from around the world. Internet photos represent a scalable data source but come with challenges such as lighting and transient objects. We address these issues to further create a subset suitable for the task of NVS. Additionally, we analyze failure cases of state-of-the-art NVS methods and significantly improve generation consistency. Through extensive experiments we validate the effectiveness of both our dataset and method on generating in-the-wild scenes.
Dataset Collection
We first source and identify potential scene categories from WikiData. Subsequently, images and metadata for each scene category is downloaded. Finally, we reconstruct scenes using Structure from Motion (SfM) and clean them using the Doppelgangers pipeline.
Dataset Statistics
We show the distribution of the MegaScenes Dataset. On the left, we depict the frequency of scenes grouped by WikiData class. This includes only select classes with more than 3,500 scenes; note that a single scene may be an instance of multiple classes. On the right, we visualize the geospatial distribution of collected scenes worldwide.
Application: Single Image Novel View Synthesis
To explore the diversity and scale of the MegaScenes Dataset, we experiment on the task of single image novel view synthesis, where the goal is to take a reference image and generate a plausible image at a target pose. We train and evaluate on image pairs with pseudo-ground-truth relative poses obtained via SfM.
Conditioning on the Extrinsic Matrix
Simply finetuning pose-conditioned diffusion models, such as ZeroNVS, signficantly improves their generalization to in-the-wild scenes. However, the depth and scale of the scene in ZeroNVS is ambiguous and requires manual tuning.
Conditioning on Warped Images
We find that first warping the image into the target pose is a strong condition that encodes how pixels are supposed to move, and is directly aligned with the scene scale. On our training and evaluation datasets, the scale is based on 3D SfM points. When given a random, in-the-wild image, we can determine the scene scale from estimated monocular depth and use the same extrinsics for conditioning and warping the image for a consistent scale.
Evaluation
We evaluate on MegaScenes’ test set, which consists of in-the-wild scenes from Internet photos. Here, we show comparisons between four models. 1. SD-inpainting: A Stable Diffusion inpainting model without any finetuning. 2. ZeroNVS (released): The ZeroNVS released checkpoint. 3. ZeroNVS (MS): ZeroNVS finetuned on MegaScenes. 4. Ours: Finetuned from ZeroNVS on MegaScenes, and conditioned on both the extrinsic matrices and the warped images. See the paper for more evaluations and baselines.
Discussion
MegaScenes is a general large-scale 3D dataset, and we foresee a variety of 3D-related applications that could benefit from MegaScenes, such as pose estimation, feature matching, and reconstruction. In this paper we focus on NVS as a representative application and we find that MegaScenes is indeed capable of training generalizable 3D models.
Acknowledgments
We thank Brandon Li for building the COLMAP webviewer. This work was funded in part by the National Science Foundation (IIS-2008313, IIS-2211259, IIS-2212084). Gene Chou was funded by an NSF Graduate Research Fellowship.
BibTeX
@inproceedings{
tung2024megascenes,
title={MegaScenes: Scene-Level View Synthesis at Scale},
author={Tung, Joseph and Chou, Gene and Cai, Ruojin and Yang, Guandao and Zhang, Kai and Wetzstein, Gordon and Hariharan, Bharath and Snavely, Noah},
booktitle={ECCV},
year={2024}
}