InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior

International Conference on Learning Representations (ICLR) 2024
🌟 Spotlight 🌟

Chenguo Lin, Yadong Mu

Peking University

InstructScene is a generative framework to synthesize 3D indoor scenes from instructions. It is composed of a semantic graph prior and a layout decoder.

Abstract

Comprehending natural language instructions is a charming property for 3D indoor scene synthesis systems. Existing methods (e.g., ATISS and DiffuScene) suffer from directly modeling the object distributions within a scene, thereby hindering the controllability of generation.

We introduce InstructScene, a novel generative framework that integrates a semantic graph prior and a layout decoder to improve controllability and fidelity for 3D scene synthesis. The proposed semantic graph prior jointly learns indoor scene appearance and layout distributions, exhibiting versatility across various generative tasks. To facilitate the benchmarking for text-driven 3D scene synthesis, we curate a high-quality dataset of scene-instruction pairs with large language and multimodal models.

Extensive experimental results reveal that the proposed method surpasses existing state-of-the-art approaches by a large margin. Thorough ablation studies confirm the efficacy of crucial design components. Our code and dataset are available at here.

Method

Scene-Instruction Pair Dataset

We construct a high-quality dataset of scene-instruction pairs based on 3D-FRONT, a professionally designed collection of synthetic indoor scenes. As it does not contain any descriptions of room layouts or object appearances, we (1) extract viewdependent spatial relations with predefined rules, and (2) caption objects appearing in the scenes with BLIP. To ensure the accuracy of descriptions, (3) the generated captions are refined by ChatGPT with object ground-truth categories. (4) The final instructions are derived from randomly selected relation triplets. For more details on dataset, please refer to the appendix of our paper. The curated dataset is available at here.

Semantic Graph Prior

Feature Quantization: semantic features for 3D objects are extracted from a frozen multimodal-aligned point cloud encoder, i.e., OpenShape, and then quantized by codebook entries.
Discrete Semantic Graph Diffusion: three categorical variables, including object categories, spatial relations and quantized features, are independently masked. Empty states are not depicted here for concision. A graph Transformer with a frozen text encoder learns the semantic graph prior by iteratively recovering corrupted graphs.

Layout Decoder

Gaussian noises are sampled to attach at every node of semantic graphs. A graph Transformer processes these graphs iteratively to remove noises and generate layout configurations, including positions (t), sizes (s) and orientations (r) of objects.

Qualitative Results

We provide visualizations for our model and two baselines, including ATISS and DiffuScene. All the synthesized scenes are rendered by Blender. Rendering script is available at here.

Instruction-Driven Synthesis

*"Add a corner side table with a round top to the left* of a black and silver pendant lamp with lights"**
ATISS	DiffuScene	InstructScene (Ours)

*"Place a black pendant lamp with hanging balls above* a grey dining table with round top. Next, position a grey dining chair to the close right below of a black pendant lamp with hanging balls"**
ATISS	DiffuScene	InstructScene (Ours)

*"Set up a brass pendant lamp with lights above* a dining table with a marble top"**
ATISS	DiffuScene	InstructScene (Ours)

Zero-shot Applications

Thanks to the discrete design and mask modeling, the learned semantic graph prior is capable of diverse downstream tasks without any fine-tuning. We investigate four zero-shot tasks, including (1) stylization, (2) re-arrangement, (3) completion, and (4) unconditional generation. The first three tasks can be regarded as conditional synthesis guided by both instructions and partial scene attributes.

Stylization

"Make the room brown style"
Original Scene	ATISS	DiffuScene	InstructScene (Ours)

"Make objects in the room black"
Original Scene	ATISS	DiffuScene	InstructScene (Ours)

"Let the room be in gray style"
Original Scene	ATISS	DiffuScene	InstructScene (Ours)

Re-arrangement

From left to right: (1) input instructions, (2) messy scenes, (3) ATISS, (4) DiffuScene, (5) InstructScene (Ours).

Completion

From left to right: (1) input instructions, (2) original scenes, (3) ATISS, (4) DiffuScene, (5) InstructScene (Ours).

Unconditional Generation

From left to right: (1) ATISS, (2) DiffuScene, (3) InstructScene (Ours).

InstructScene without Semantic Features

Left three columns: (1) input instructions, (2) InstructScene without semantic features, (3) InstructScene (Ours).
Right three columns: unconditional generation without semantic features.

A significant decline in the appearance controllability and style consistency can be observed when semantic features were omitted. It arises from the fact that, without semantic features, the generative models solely focus on modeling the distributions of layout attributes. This exclusion of semantic features results in generated objects whose occurrences and combinations lack awareness of object style and appearance, which are crucial elements in scene design.

Diversity

Left three columns: a diverse set of scenes generated from the same instructions.
Right three columns: a diverse set of scenes generated from the same semantic graphs.

Quantitative Results

Instruction-Driven Synthesis

ATISS outperforms DiffuScene in terms of generation fidelity, owing to its capacity to model in discrete spaces. DiffuScene shows better controllability to ATISS because it affords global visibility of samples during generation. The proposed InstructScene exhibits the best of both worlds.

It is noteworthy that InstructScene excels in handling more complex scenes, such as living and dining rooms, revealing the benefits of modeling intricate 3D scenes associated with the semantic graph prior.

Zero-shot Applications

While ATISS, as an auto-regressive model, is a natural fit for the completion task, its unidirectional dependency chain limits its effectiveness for tasks requiring global scene modeling, such as re-arrangement. DiffuScene can adapt to these tasks by replacing the known parts with the noised corresponding scene attributes during sampling, similar to image in-painting. However, the known attributes are greatly corrupted in the early steps, which could misguide the denoising direction, therefore necessitating fine-tuning. Additionally, it also faces challenges in searching for semantic features in a continuous space for stylization. In contrast, the proposed InstructScene globally models scene attributes and treats partial scene attributes as intermediate discrete states during training.

BibTeX

If you find our work helpful, please consider citing:

          
@inproceedings{lin2024instructscene,
  title={InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior},
  author={Lin, Chenguo and Mu, Yadong},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2024}
}

InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior

International Conference on Learning Representations (ICLR) 2024 🌟 Spotlight 🌟

InstructScene is a generative framework to synthesize 3D indoor scenes from instructions. It is composed of a semantic graph prior and a layout decoder.

Abstract

Method

Scene-Instruction Pair Dataset

Semantic Graph Prior

Layout Decoder

Qualitative Results

Instruction-Driven Synthesis

Zero-shot Applications

Stylization

Re-arrangement

Completion

Unconditional Generation

InstructScene without Semantic Features

Diversity

Quantitative Results

Instruction-Driven Synthesis

Zero-shot Applications

Related Links

BibTeX

International Conference on Learning Representations (ICLR) 2024
🌟 Spotlight 🌟