We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary large language models for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-source language models to obtain scene elements from text prompts and a novel aspect-aware diffusion Transformer architecture trained in an open-vocabulary manner for conditional layout generation. Extensive experiments demonstrate that LayouSyn outperforms existing methods and achieves state-of-the-art performance on challenging spatial and numerical reasoning benchmarks. Additionally, we present two applications of LayouSyn. First, we show that coarse initialization from large language models can be seamlessly combined with our method to achieve better results. Second, we present a pipeline for adding objects to images, demonstrating the potential of LayouSyn in image editing applications.
Figure 2: Overview of inference pipeline for LayouSyn
We frame the scene layout generation task as a two-stage process:
Figure 3: Comparative analysis with LayoutGPT. In the first example, LayoutGPT produces a semantically incorrect layout, with the table and chairs not positioned under the lamp while our method follows the constraints precisely. In the second example, LayoutGPT generates a geometrically incorrect layout for the cat, whereas our method successfully understands the relationships between different objects and produces a correct layout.
Figure 4: Diversity of layouts generated by LayouSyn for the same text prompt.
Figure 5: Layout generation with varying aspect ratios. Layouts generated at different aspect ratios for prompt: "A man riding a horse on the street." The model adjusts the position and aspect ratio of the man and the horse to produce natural-looking layouts.
We evaluate LayouSyn on two criteria:
Model | L-FID ↓ |
---|---|
LayoutGPT (GPT-3.5) | 3.51 |
LayoutGPT (GPT-4o-mini) | 6.72 |
Llama-3.1-8B (finetuned) | 13.95 |
LayouSyn | 3.07 (+12.5%) |
LayouSyn (GRIT pretraining) | 3.31 (+5.6%) |
Table 1: Layout Quality Evaluation on the COCO-GR Dataset: Our method outperforms existing layout generation methods on the FID (L-FID) score by at least 5.6%
Numerical Reasoning | Spatial Reasoning | |||||
---|---|---|---|---|---|---|
Prec. ↑ | Recall ↑ | Acc. ↑ | GLIP ↑ | Acc. ↑ | GLIP ↑ | |
GT layouts | 100.0 | 100.0 | 100.0 | 50.08 | 100.00 | 57.20 |
In-context Learning | ||||||
LayoutGPT (Llama-3.1-8B) | 78.61 | 84.01 | 71.71 | 49.48 | 75.40 | 47.92 |
LayoutGPT (GPT-3.5) | 76.29 | 86.64 | 76.72 | 54.25 | 87.07 | 56.89 |
LayoutGPT (GPT-4o-mini) | 73.82 | 86.84 | 77.51 | 57.96 | 92.01 | 60.49 |
Zero-shot | ||||||
LLMGroundedDiffusion (GPT-4o-mini) | 84.36 | 95.94 | 89.94 | 38.56 | 72.46 | 27.09 |
LLM Blueprint (GPT-4o-mini) | 87.21 | 67.29 | 38.36 | 42.24 | 73.52 | 50.21 |
Trained / Finetuned | ||||||
LayoutTransformer * | 75.70 | 61.69 | 22.26 | 40.55 | 6.36 | 28.13 |
Ranni | 56.23 | 83.28 | 40.80 | 38.19 | 53.29 | 24.38 |
Llama-3.1-8B (finetuned) | 79.33 | 93.36 | 70.84 | 44.72 | 86.64 | 52.93 |
Ours | ||||||
LayouSyn | 77.62 | 99.23 | 95.14 | 56.17 | 87.49 | 54.91 |
LayouSyn (GRIT pretraining) | 77.62 | 99.23 | 95.14 | 56.20 | 92.58 | 58.94 |
Table 2: Spatial and numerical reasoning evaluation on the NSR-1K Benchmark. LayouSyn outperforms existing methods on spatial and counting reasoning tasks, achieving state-of-the-art performance on most metrics. Note: * indicates metrics reported by LayoutGPT. We bold values for metrics where our method (GT) is 100% and underline values where methods exceed the ground truth performance.
Method | Llama-3.1-8B | GPT-3.5 | GPT-4o-mini |
---|---|---|---|
Original | 75.40 | 87.07 | 92.01 |
Description Set | 89.75 | 90.04 | 90.95 |
Description Set + Inv (15) | 90.46 | 92.37 | 92.08 |
Table 3: Spatial reasoning results with LLM initialization. We take the outputs from LayoutGPT with different LLMs (Original) and design two strategies: 1) Description set only: Use only the description sets predicted by the LayoutGPT and perform denoising starting from Gaussian noise with full 100 denoising steps; 2) Description Set + Inversion: in addition to using the description sets, apply DDIM inversion on the bounding boxes predicted by the LLM and denoise for the same number of steps as inversion
Figure 6: Examples of automated object addition using LayouSyn.
@article{srivastava2025layyourscenenaturalscenelayout,
title={Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers},
author={Divyansh Srivastava and Xiang Zhang and He Wen and Chenru Wen and Zhuowen Tu},
year={2025},
eprint={2505.04718},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.04718},
}