OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps

* equal contribution
1University of California, San Diego 2Lambda, Inc.
teaser
Examples from OverLayBench with difficulty increasing from left to right.

Abstract

Despite steady progress in layout-to-image generation, current methods still struggle with layouts containing significant overlap between bounding boxes. We identify two primary challenges: (1) large overlapping regions and (2) overlapping instances with minimal semantic distinction. Through both qualitative examples and quantitative analysis, we demonstrate how these factors degrade generation quality. To systematically assess this issue, we introduce OverLayScore, a novel metric that quantifies the complexity of overlapping bounding boxes. Our analysis reveals that existing benchmarks are biased toward simpler cases with low OverLayScore values, limiting their effectiveness in evaluating models under more challenging conditions. To reduce this gap, we present OverLayBench, a new benchmark featuring balanced OverLayScore distributions and high-quality annotations. As an initial step toward improved performance on complex overlaps, we also propose CreatiLayout-AM, a model trained on a curated amodal mask dataset. Together, our contributions establish a foundation for more robust layout-to-image generation under realistic and challenging scenarios.

OverLayScore

We created OverLayScore to measure how overlapping objects make layout-to-image (L2I) generation harder. This metric looks at two key factors:
  • Spatial overlap — how much the bounding boxes of objects intersect, measured by Intersection over Union (IoU).
  • Semantic similarity — how similar the descriptions of the objects are in the CLIP feature space.
motivation
Image quality of CreatiLayout across varying levels of bounding box IoU and semantic similarity, from poorer to better.
When two objects overlap more, the quality of generated images usually drops. If they also have similar captions (like both describing “a dog”), the problem becomes even worse. OverLayScore combines these effects into a single value, making it easier to benchmark and compare models on challenging layouts. Formally, OverLayScore is defined as: $$ \text{OverLayScore} = \sum_{\substack{(i,j): \text{IoU}(B_i,B_j)>0}} \text{IoU}(B_i,B_j)\,\cdot\,\cos\bigl(\langle p_i, p_j \rangle \bigr) $$ Where \(B_i\) and \(B_j\) are the bounding boxes of objects, and \(\cos\bigl(\langle p_i, p_j \rangle \bigr)\) is the cosine similarity of their CLIP features.

OverLayBench

OverLayBench is a benchmark to test how well models generate images from layouts with overlapping objects. We create the dataset in three steps:
  1. Generate reference images from captions generated by VLM from real-images.
  2. Use VLM to detect objects, re-captions, and extract relationships.
  3. Manually curate annotations to ensure accuracy.
OverLayBench includes 4,052 examples with three overlap difficulty, measured by OverLayScore, covering a balanced range of overlap difficulty levels:
  • Simple: 2,052 cases
  • Regular: 1,000 cases
  • Complex: 1,000 cases
teaser
An overview of the OverLayBench pipeline.

Quantitative Results

Method # Params mIoU (%)\(\uparrow\) O-mIoU (%)\(\uparrow\) SRE (%)\(\uparrow\) SRR (%)\(\uparrow\) CLIPGlobal\(\uparrow\) CLIPLocal\(\uparrow\) FID\(\downarrow\)
UNet-based Models
GLIGEN 1.07B 60.54 36.22 49.99 78.72 34.17 24.75 31.27
InstanceDiff 1.23B 71.21 49.99 77.71 87.49 34.25 27.69 36.17
MIGC 0.86B 58.64 32.15 63.41 81.60 33.07 26.49 31.64
HiCo 1.22B 69.47 47.23 67.75 86.08 35.25 27.04 29.21
DiT-based Models
3DIS N/A 65.75 38.38 86.24 86.98 35.85 29.67 29.18
ELIGEN 12.2B 68.17 43.72 86.50 89.67 36.65 28.29 28.87
CreatiLayout-SD3 3.31B 58.78 32.52 72.34 84.45 37.29 27.49 27.51
CreatiLayout-FLUX 21.78B 71.17 49.80 84.35 90.87 37.40 28.07 23.79
Method # Params mIoU (%)\(\uparrow\) O-mIoU (%)\(\uparrow\) SRE (%)\(\uparrow\) SRR (%)\(\uparrow\) CLIPGlobal\(\uparrow\) CLIPLocal\(\uparrow\) FID\(\downarrow\)
UNet-based Models
GLIGEN 1.07B 52.46 34.15 44.88 77.46 33.93 23.42 52.22
InstanceDiff 1.23B 60.08 26.53 72.51 83.36 33.09 26.19 59.73
MIGC 0.86B 47.42 20.06 56.67 77.85 32.72 24.99 54.24
HiCo 1.22B 55.02 29.60 58.24 79.89 33.91 25.34 49.07
DiT-based Models
3DIS N/A 55.66 27.29 80.80 83.69 35.42 28.12 48.56
ELIGEN 12.2B 58.56 32.62 80.85 84.42 36.27 27.05 45.65
CreatiLayout-SD3 3.31B 47.04 20.67 62.60 78.31 36.67 25.55 45.57
CreatiLayout-FLUX 21.78B 59.72 35.51 77.20 86.39 36.73 26.21 41.51
Method # Params mIoU (%)\(\uparrow\) O-mIoU (%)\(\uparrow\) SRE (%)\(\uparrow\) SRR (%)\(\uparrow\) CLIPGlobal\(\uparrow\) CLIPLocal\(\uparrow\) FID\(\downarrow\)
UNet-based Models
GLIGEN 1.07B 47.20 25.63 41.70 79.93 33.92 22.75 57.32
InstanceDiff 1.23B 53.68 23.85 66.02 80.34 32.33 25.53 66.32
MIGC 0.86B 40.04 13.26 47.80 74.48 31.93 24.20 66.52
HiCo 1.22B 46.56 20.35 48.88 75.19 33.15 24.41 55.78
DiT-based Models
3DIS N/A 50.65 21.75 74.31 81.57 35.11 27.35 54.90
ELIGEN 12.2B 52.53 26.19 74.03 84.09 36.18 25.92 50.41
CreatiLayout-SD3 3.31B 44.24 18.05 52.10 79.98 36.55 24.76 53.29
CreatiLayout-FLUX 21.78B 54.50 28.97 69.72 86.45 36.72 24.85 45.66
Method mIoU (%)\(\uparrow\) O-mIoU (%)\(\uparrow\) SRE (%)\(\uparrow\) SRR (%)\(\uparrow\) CLIPGlobal\(\uparrow\) CLIPLocal\(\uparrow\) FID\(\downarrow\)
UNet-based Models
BoxDiff 24.48 7.71 42.03 69.94 36.78 21.33 34.65
LayoutGuidance 23.12 7.92 45.78 70.83 33.47 21.22 74.40
R&B 27.78 9.70 36.98 64.05 34.64 21.32 36.57
DiT-based Models
GroundDiT 31.92 10.93 48.57 75.26 36.10 22.73 34.98
RegionalPrompting 42.54 20.10 73.49 75.81 35.45 27.40 23.94
Method mIoU (%)\(\uparrow\) O-mIoU (%)\(\uparrow\) SRE (%)\(\uparrow\) SRR (%)\(\uparrow\) CLIPGlobal\(\uparrow\) CLIPLocal\(\uparrow\) FID\(\downarrow\)
UNet-based Models
BoxDiff 19.40 5.33 37.58 71.81 36.50 19.97 55.41
LayoutGuidance 15.81 4.51 44.94 72.76 31.76 20.05 128.16
R&B 20.35 5.54 32.85 65.01 34.49 19.88 58.55
DiT-based Models
GroundDiT 24.03 6.70 42.24 74.02 36.00 21.10 54.94
RegionalPrompting 32.72 12.29 63.74 67.08 34.46 25.82 43.13
Method mIoU (%)\(\uparrow\) O-mIoU (%)\(\uparrow\) SRE (%)\(\uparrow\) SRR (%)\(\uparrow\) CLIPGlobal\(\uparrow\) CLIPLocal\(\uparrow\) FID\(\downarrow\)
UNet-based Models
BoxDiff 20.02 5.21 33.52 76.41 36.92 19.91 59.70
LayoutGuidance 16.34 4.01 37.76 75.53 32.75 19.72 119.32
R&B 19.80 4.85 28.38 69.97 34.57 19.47 63.71
DiT-based Models
GroundDiT 24.77 6.58 37.85 77.03 36.20 20.55 55.59
RegionalPrompting 28.35 9.05 53.56 60.37 33.34 25.29 49.41

Qualitative Results

teaser
Comparison of generated images from different models on our OverLayBench.

CreatiLayout-AM

CreatiLayout-AM is an enhanced version of CreatiLayout designed to better handle complex overlapping objects. Unlike standard training, it incorporates amodal masks—complete object shapes even when parts are occluded—during training. By fine-tuning the model on a large set of synthetic occlusion examples and aligning its attention maps with ground truth amodal masks, CreatiLayout-AM improves the generation of distinct, coherent objects in overlapped regions. Experiments show consistent gains in overlap-sensitive metrics like O-mIoU, confirming that explicit occlusion supervision can significantly improve generation quality under challenging layouts.

Qualitative Comparison between basemodel

Method Split mIoU (%)\(\uparrow\) O-mIoU (%)\(\uparrow\) SRE (%)\(\uparrow\) SRR (%)\(\uparrow\) CLIPGlobal\(\uparrow\) CLIPLocal\(\uparrow\) FID\(\downarrow\)
CreatiLayout Simple 58.78 32.52 72.34 84.45 37.29 27.49 27.51
Ours Simple 61.16 37.69 73.33 84.84 37.17 27.44 27.76
vs. BaseModel +4.05% +15.88% +1.37% +0.46% -0.32% -0.18% +0.91%
CreatiLayout Regular 47.04 20.67 62.60 78.31 36.67 25.55 45.57
Ours Regular 47.38 21.79 63.13 78.71 36.49 25.46 46.34
vs. BaseModel +0.72% +5.42% +0.85% +0.51% -0.49% -0.35% +1.68%
CreatiLayout Complex 44.24 18.05 52.10 79.98 36.55 24.76 53.29
Ours Complex 43.97 18.07 52.49 79.77 36.32 24.72 53.48
vs. BaseModel -0.61% +0.11% +0.75% -0.26% -0.63% -0.16% +0.36%

Qualitative Comparison between basemodel

teaser
Comparison of generated image from CreatiLayout and CreatiLayout-AM. The CreatiLayout-AM generates overlap instances better, leading to more coherent and realistic images.

BibTeX