| CARVIEW |
RealmDreamer: Text-Driven 3D Scene Generation
with Inpainting and Depth Diffusion
RealmDreamer: Text-Driven 3D Scene Generation
with Inpainting and Depth Diffusion
We generate large, explorable 3D scenes from a text-description
with just pretrained 2D diffusion models
Abstract
We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require training on any scene-specific dataset and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.
Abstract
We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require training on any scene-specific dataset and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.
Results
A bear sitting in a classroom with a hat on, realistic, 4k image, high detail
Inpainting priors are great for occlusion reasoning
Using text conditioned 2D diffusion models for 3D scene generation is tricky given the lack of 3D consistency across different samples. We mitigate this by leveraging 2D inpainting priors as novel view estimators instead. By rendering an incomplete 3D model and inpainting unknown regions, we learn to generate consistent 3D scenes.
Image to 3D
We show that our technique can generate 3D scenes from a single image. This is a challenging task as it requires the model to hallucinate the missing geometry and texture in the scene. We do not require training on any scene-specific dataset.
Input Image
Input Image
How?
Step 1: Generate a Prototype
We start by generating a cheap 2D prototype of the 3D scene from a text description using a pretrained text-to-image generator. Given the desired image, we lift its content into 3D using monocular depth estimator, before computing the occlusion volume. This serves as the initialization for a 3D Gaussian Splatting (3DGS) representation.
Step 2: Inpaint Missing Regions
The generated 3D scene is incomplete and contains missing regions. To fill them in, we leverage a 2D inpainting diffusion model and optimize the splats to match its output over multiple views. An additional depth distillation loss on sampled images ensure the inpainted regions are geometrically plausible.
Step 3: Refine the Scene
Finally, we refine the 3D model to improve the cohesion between inpainted regions and the prototype by using a vanilla text-to-image diffusion model. An additional sharpness filter ensures the generated samples are more detailed.
Related Work
There are many related works that have influenced our technique:
- Dreamfusion, Score Jacobian Chaining, and ProlificDreamer pioneer pretrained 2D Diffusion Models for 3D generation.
- Text2Room shows the effectiveness of iterative approaches for indoor scene synthesis.
There are also some concurrent work that tackle scene generation or use inpainting models for similar applications:
Citation
If you find our work interesting, please consider citing us!
@inproceedings{shriram2024realmdreamer,
title={RealmDreamer: Text-Driven 3D Scene Generation with
Inpainting and Depth Diffusion},
author={Jaidev Shriram and Alex Trevithick and Lingjie Liu and Ravi Ramamoorthi},
journal={International Conference on 3D Vision (3DV)},
year={2025}
}
Acknowledgements
We thank Jiatao Gu and Kai-En Lin for early discussions, Aleksander Holynski and Ben Poole for later discussions. We thank Michelle Chiu for video design help. This work was supported in part by an NSF graduate Fellowship, ONR grant N00014-23-1-2526, NSF CHASE-CI Grants 2100237 and 2120019, gifts from Adobe, Google, Qualcomm, Meta, the Ronald L. Graham Chair, and the UC San Diego Center for Visual Computing.
The website template was built on Reconfusion, Michaël Gharbi, and Ref-NeRF.