🎨 Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

ICCV 2025

Divyansh Srivastava, Xiang Zhang, He Wen^†, Chenru Wen^†, Zhuowen Tu

^† Project done while interning at UC San Diego.

Figure 1: Lay-Your-Scene (shorthand LayouSyn) demonstrates superior scene awareness, generating layouts with high geometric plausibility and strictly adhering to numerical and spatial constraints. Object nouns in the prompts are highlighted with corresponding colors in the layout.

Abstract

We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary large language models for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-source language models to obtain scene elements from text prompts and a novel aspect-aware diffusion Transformer architecture trained in an open-vocabulary manner for conditional layout generation. Extensive experiments demonstrate that LayouSyn outperforms existing methods and achieves state-of-the-art performance on challenging spatial and numerical reasoning benchmarks. Additionally, we present two applications of LayouSyn. First, we show that coarse initialization from large language models can be seamlessly combined with our method to achieve better results. Second, we present a pipeline for adding objects to images, demonstrating the potential of LayouSyn in image editing applications.

Method

Figure 2: Overview of inference pipeline for LayouSyn

We frame the scene layout generation task as a two-stage process:

Description Set Generation: A lightweight open-source language model extracts relevant object descriptions from the text prompt. For example, if prompt is "Three people walking on the street", the model outputs a JSON with count of each object, i.e, {"person": 3, "street": 1}.
Conditional Layout Generation: A trained, aspect-aware, diffusion model generates layouts conditioned on the text prompt and object descriptions directly in bounding-box space.

Qualitative Results

Qualitative comparisons between LayoutGPT+GLIGEN and LayouSyn+GLIGEN

Figure 3: Comparative analysis with LayoutGPT. In the first example, LayoutGPT produces a semantically incorrect layout, with the table and chairs not positioned under the lamp while our method follows the constraints precisely. In the second example, LayoutGPT generates a geometrically incorrect layout for the cat, whereas our method successfully understands the relationships between different objects and produces a correct layout.

Diversity of generated layouts and aspect ratio variations

Figure 4: Diversity of layouts generated by LayouSyn for the same text prompt.

Figure 5: Layout generation with varying aspect ratios. Layouts generated at different aspect ratios for prompt: "A man riding a horse on the street." The model adjusts the position and aspect ratio of the man and the horse to produce natural-looking layouts.

Quantitative Results

We evaluate LayouSyn on two criteria:

Layout Quality: We draw the generated layout as an image and map each object to a specific color following document layout generation literature, taking into account semantic similarity between objects based on CLIP similarity. We refer to this metric as L-FID (Layout-FID).
Model L-FID ↓
LayoutGPT (GPT-3.5) 3.51
LayoutGPT (GPT-4o-mini) 6.72
Llama-3.1-8B (finetuned) 13.95
LayouSyn 3.07 (+12.5%)
LayouSyn (GRIT pretraining) 3.31 (+5.6%)
Table 1: Layout Quality Evaluation on the COCO-GR Dataset: Our method outperforms existing layout generation methods on the FID (L-FID) score by at least 5.6%
Spatial and Numerical prompt-following ability: We evaluate our method on the NSR-1K benchmark and assess whether the generated layouts follow specified numerical and spatial constraints in the prompt.

Model	L-FID ↓
LayoutGPT (GPT-3.5)	3.51
LayoutGPT (GPT-4o-mini)	6.72
Llama-3.1-8B (finetuned)	13.95
LayouSyn	3.07 (+12.5%)
LayouSyn (GRIT pretraining)	3.31 (+5.6%)

	Numerical Reasoning				Spatial Reasoning
	Prec. ↑	Recall ↑	Acc. ↑	GLIP ↑	Acc. ↑	GLIP ↑
GT layouts	100.0	100.0	100.0	50.08	100.00	57.20
In-context Learning
LayoutGPT (Llama-3.1-8B)	78.61	84.01	71.71	49.48	75.40	47.92
LayoutGPT (GPT-3.5)	76.29	86.64	76.72	54.25	87.07	56.89
LayoutGPT (GPT-4o-mini)	73.82	86.84	77.51	57.96	92.01	60.49
Zero-shot
LLMGroundedDiffusion (GPT-4o-mini)	84.36	95.94	89.94	38.56	72.46	27.09
LLM Blueprint (GPT-4o-mini)	87.21	67.29	38.36	42.24	73.52	50.21
Trained / Finetuned
LayoutTransformer *	75.70	61.69	22.26	40.55	6.36	28.13
Ranni	56.23	83.28	40.80	38.19	53.29	24.38
Llama-3.1-8B (finetuned)	79.33	93.36	70.84	44.72	86.64	52.93
Ours
LayouSyn	77.62	99.23	95.14	56.17	87.49	54.91
LayouSyn (GRIT pretraining)	77.62	99.23	95.14	56.20	92.58	58.94

Table 2: Spatial and numerical reasoning evaluation on the NSR-1K Benchmark. LayouSyn outperforms existing methods on spatial and counting reasoning tasks, achieving state-of-the-art performance on most metrics. Note: * indicates metrics reported by LayoutGPT. We bold values for metrics where our method (GT) is 100% and underline values where methods exceed the ground truth performance.

Applications

LLM Integration: LayouSyn can be integrated with an LLM, using its planned layouts as initialization and refining them to achieve better performance with equal or fewer sampling steps. We demonstrate the improvement on the NSR-1K spatial reasoning benchmark:

Method	Llama-3.1-8B	GPT-3.5	GPT-4o-mini
Original	75.40	87.07	92.01
Description Set	89.75	90.04	90.95
Description Set + Inv (15)	90.46	92.37	92.08

Table 3: Spatial reasoning results with LLM initialization. We take the outputs from LayoutGPT with different LLMs (Original) and design two strategies: 1) Description set only: Use only the description sets predicted by the LayoutGPT and perform denoising starting from Gaussian noise with full 100 denoising steps; 2) Description Set + Inversion: in addition to using the description sets, apply DDIM inversion on the bounding boxes predicted by the LLM and denoise for the same number of steps as inversion

Image Editing Pipeline: LayouSyn enables automated pipeline for adding objects to images. The pipeline: (1) extract relevant objects from the prompt with a lightweight LLM, (2) detect existing objects in the scene with Grounding DINO, (3) complete the layout for the new object with LayouSyn, and (4) inpaint the object into the image with GLIGEN inpainting pipeline.

Figure 6: Examples of automated object addition using LayouSyn.

BibTeX


  @InProceedings{Srivastava_2025_ICCV_Lay_Your_Scene,
      author    = {Srivastava, Divyansh and Zhang, Xiang and Wen, He and Wen, Chenru and Tu, Zhuowen},
      title     = {Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers},
      booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
      month     = {October},
      year      = {2025},
      pages     = {17909-17919}
  }