| CARVIEW |
InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior
International Conference on Learning Representations (ICLR) 2024
🌟 Spotlight 🌟
Abstract
Comprehending natural language instructions is a charming property for 3D indoor scene synthesis systems. Existing methods (e.g., ATISS and DiffuScene) suffer from directly modeling the object distributions within a scene, thereby hindering the controllability of generation.
We introduce InstructScene, a novel generative framework that integrates a semantic graph prior and a layout decoder to improve controllability and fidelity for 3D scene synthesis. The proposed semantic graph prior jointly learns indoor scene appearance and layout distributions, exhibiting versatility across various generative tasks. To facilitate the benchmarking for text-driven 3D scene synthesis, we curate a high-quality dataset of scene-instruction pairs with large language and multimodal models.
Extensive experimental results reveal that the proposed method surpasses existing state-of-the-art approaches by a large margin. Thorough ablation studies confirm the efficacy of crucial design components. Our code and dataset are available at here.
Method
Scene-Instruction Pair Dataset
We construct a high-quality dataset of scene-instruction pairs based on 3D-FRONT, a professionally designed collection of synthetic indoor scenes. As it does not contain any descriptions of room layouts or object appearances, we (1) extract viewdependent spatial relations with predefined rules, and (2) caption objects appearing in the scenes with BLIP. To ensure the accuracy of descriptions, (3) the generated captions are refined by ChatGPT with object ground-truth categories. (4) The final instructions are derived from randomly selected relation triplets. For more details on dataset, please refer to the appendix of our paper. The curated dataset is available at here.
Semantic Graph Prior
- Feature Quantization: semantic features for 3D objects are extracted from a frozen multimodal-aligned point cloud encoder, i.e., OpenShape, and then quantized by codebook entries.
- Discrete Semantic Graph Diffusion: three categorical variables, including object categories, spatial relations and quantized features, are independently masked. Empty states are not depicted here for concision. A graph Transformer with a frozen text encoder learns the semantic graph prior by iteratively recovering corrupted graphs.
Layout Decoder
Gaussian noises are sampled to attach at every node of semantic graphs. A graph Transformer processes these graphs iteratively to remove noises and generate layout configurations, including positions (t), sizes (s) and orientations (r) of objects.
Qualitative Results
We provide visualizations for our model and two baselines, including ATISS and DiffuScene. All the synthesized scenes are rendered by Blender. Rendering script is available at here.
Instruction-Driven Synthesis
| "Add a corner side table with a round top to the left of a black and silver pendant lamp with lights" | ||
| ATISS | DiffuScene | InstructScene (Ours) |
| "Place a black pendant lamp with hanging balls above a grey dining table with round top. Next, position a grey dining chair to the close right below of a black pendant lamp with hanging balls" | ||
| ATISS | DiffuScene | InstructScene (Ours) |
| "Set up a brass pendant lamp with lights above a dining table with a marble top" | ||
| ATISS | DiffuScene | InstructScene (Ours) |
Zero-shot Applications
Stylization
| "Make the room brown style" | |||
| Original Scene | ATISS | DiffuScene | InstructScene (Ours) |
| "Make objects in the room black" | |||
| Original Scene | ATISS | DiffuScene | InstructScene (Ours) |
| "Let the room be in gray style" | |||
| Original Scene | ATISS | DiffuScene | InstructScene (Ours) |
Re-arrangement
From left to right: (1) input instructions, (2) messy scenes, (3) ATISS, (4) DiffuScene, (5) InstructScene (Ours).
Completion
From left to right: (1) input instructions, (2) original scenes, (3) ATISS, (4) DiffuScene, (5) InstructScene (Ours).
Unconditional Generation
From left to right: (1) ATISS, (2) DiffuScene, (3) InstructScene (Ours).
InstructScene without Semantic Features
- Left three columns: (1) input instructions, (2) InstructScene without semantic features, (3) InstructScene (Ours).
- Right three columns: unconditional generation without semantic features.
A significant decline in the appearance controllability and style consistency can be observed when semantic features were omitted. It arises from the fact that, without semantic features, the generative models solely focus on modeling the distributions of layout attributes. This exclusion of semantic features results in generated objects whose occurrences and combinations lack awareness of object style and appearance, which are crucial elements in scene design.
Diversity
- Left three columns: a diverse set of scenes generated from the same instructions.
- Right three columns: a diverse set of scenes generated from the same semantic graphs.
Quantitative Results
Instruction-Driven Synthesis
ATISS outperforms DiffuScene in terms of generation fidelity, owing to its capacity to model in discrete spaces. DiffuScene shows better controllability to ATISS because it affords global visibility of samples during generation. The proposed InstructScene exhibits the best of both worlds.
It is noteworthy that InstructScene excels in handling more complex scenes, such as living and dining rooms, revealing the benefits of modeling intricate 3D scenes associated with the semantic graph prior.
Zero-shot Applications
While ATISS, as an auto-regressive model, is a natural fit for the completion task, its unidirectional dependency chain limits its effectiveness for tasks requiring global scene modeling, such as re-arrangement. DiffuScene can adapt to these tasks by replacing the known parts with the noised corresponding scene attributes during sampling, similar to image in-painting. However, the known attributes are greatly corrupted in the early steps, which could misguide the denoising direction, therefore necessitating fine-tuning. Additionally, it also faces challenges in searching for semantic features in a continuous space for stylization. In contrast, the proposed InstructScene globally models scene attributes and treats partial scene attributes as intermediate discrete states during training.
Related Links
BibTeX
If you find our work helpful, please consider citing:
@inproceedings{lin2024instructscene,
title={InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior},
author={Lin, Chenguo and Mu, Yadong},
booktitle={International Conference on Learning Representations (ICLR)},
year={2024}
}