InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior

International Conference on Learning Representations (ICLR) 2024
🌟 Spotlight 🌟

Peking University

InstructScene is a generative framework to synthesize 3D indoor scenes from instructions. It is composed of a semantic graph prior and a layout decoder.


Comprehending natural language instructions is a charming property for 3D indoor scene synthesis systems. Existing methods (e.g., ATISS and DiffuScene) suffer from directly modeling the object distributions within a scene, thereby hindering the controllability of generation.

We introduce InstructScene, a novel generative framework that integrates a semantic graph prior and a layout decoder to improve controllability and fidelity for 3D scene synthesis. The proposed semantic graph prior jointly learns indoor scene appearance and layout distributions, exhibiting versatility across various generative tasks. To facilitate the benchmarking for text-driven 3D scene synthesis, we curate a high-quality dataset of scene-instruction pairs with large language and multimodal models.

Extensive experimental results reveal that the proposed method surpasses existing state-of-the-art approaches by a large margin. Thorough ablation studies confirm the efficacy of crucial design components. Our code and dataset are available at here.


Scene-Instruction Pair Dataset

We construct a high-quality dataset of scene-instruction pairs based on 3D-FRONT, a professionally designed collection of synthetic indoor scenes. As it does not contain any descriptions of room layouts or object appearances, we (1) extract viewdependent spatial relations with predefined rules, and (2) caption objects appearing in the scenes with BLIP. To ensure the accuracy of descriptions, (3) the generated captions are refined by ChatGPT with object ground-truth categories. (4) The final instructions are derived from randomly selected relation triplets. For more details on dataset, please refer to the appendix of our paper. The curated dataset is available at here.

Semantic Graph Prior

  • Feature Quantization: semantic features for 3D objects are extracted from a frozen multimodal-aligned point cloud encoder, i.e., OpenShape, and then quantized by codebook entries.
  • Discrete Semantic Graph Diffusion: three categorical variables, including object categories, spatial relations and quantized features, are independently masked. Empty states are not depicted here for concision. A graph Transformer with a frozen text encoder learns the semantic graph prior by iteratively recovering corrupted graphs.

Layout Decoder

Gaussian noises are sampled to attach at every node of semantic graphs. A graph Transformer processes these graphs iteratively to remove noises and generate layout configurations, including positions (t), sizes (s) and orientations (r) of objects.

Qualitative Results

We provide visualizations for our model and two baselines, including ATISS and DiffuScene. All the synthesized scenes are rendered by Blender. Rendering script is available at here.

Instruction-Driven Synthesis

"Add a corner side table with a round top to the left of a black and silver pendant lamp with lights"
ATISS DiffuScene InstructScene (Ours)
"Place a black pendant lamp with hanging balls above a grey dining table with round top. Next, position a grey dining chair to the close right below of a black pendant lamp with hanging balls"
ATISS DiffuScene InstructScene (Ours)
"Set up a brass pendant lamp with lights above a dining table with a marble top"
ATISS DiffuScene InstructScene (Ours)

Zero-shot Applications

Thanks to the discrete design and mask modeling, the learned semantic graph prior is capable of diverse downstream tasks without any fine-tuning. We investigate four zero-shot tasks, including (1) stylization, (2) re-arrangement, (3) completion, and (4) unconditional generation. The first three tasks can be regarded as conditional synthesis guided by both instructions and partial scene attributes.


"Make the room brown style"
Original Scene ATISS DiffuScene InstructScene (Ours)
"Make objects in the room black"
Original Scene ATISS DiffuScene InstructScene (Ours)
"Let the room be in gray style"
Original Scene ATISS DiffuScene InstructScene (Ours)


From left to right: (1) input instructions, (2) messy scenes, (3) ATISS, (4) DiffuScene, (5) InstructScene (Ours).


From left to right: (1) input instructions, (2) original scenes, (3) ATISS, (4) DiffuScene, (5) InstructScene (Ours).

Unconditional Generation

From left to right: (1) ATISS, (2) DiffuScene, (3) InstructScene (Ours).

InstructScene without Semantic Features

  • Left three columns: (1) input instructions, (2) InstructScene without semantic features, (3) InstructScene (Ours).
  • Right three columns: unconditional generation without semantic features.

A significant decline in the appearance controllability and style consistency can be observed when semantic features were omitted. It arises from the fact that, without semantic features, the generative models solely focus on modeling the distributions of layout attributes. This exclusion of semantic features results in generated objects whose occurrences and combinations lack awareness of object style and appearance, which are crucial elements in scene design.


  • Left three columns: a diverse set of scenes generated from the same instructions.
  • Right three columns: a diverse set of scenes generated from the same semantic graphs.

Quantitative Results

Instruction-Driven Synthesis

ATISS outperforms DiffuScene in terms of generation fidelity, owing to its capacity to model in discrete spaces. DiffuScene shows better controllability to ATISS because it affords global visibility of samples during generation. The proposed InstructScene exhibits the best of both worlds.

It is noteworthy that InstructScene excels in handling more complex scenes, such as living and dining rooms, revealing the benefits of modeling intricate 3D scenes associated with the semantic graph prior.

Zero-shot Applications

While ATISS, as an auto-regressive model, is a natural fit for the completion task, its unidirectional dependency chain limits its effectiveness for tasks requiring global scene modeling, such as re-arrangement. DiffuScene can adapt to these tasks by replacing the known parts with the noised corresponding scene attributes during sampling, similar to image in-painting. However, the known attributes are greatly corrupted in the early steps, which could misguide the denoising direction, therefore necessitating fine-tuning. Additionally, it also faces challenges in searching for semantic features in a continuous space for stylization. In contrast, the proposed InstructScene globally models scene attributes and treats partial scene attributes as intermediate discrete states during training.


If you find our work helpful, please consider citing:

  title={InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior},
  author={Chenguo Lin and Yadong Mu},
  booktitle={International Conference on Learning Representations (ICLR)},