TokenCompose: Text-to-Image Diffusion with Token-level Supervision

Zirui Wang ^{1, 3}, Zhizhou Sha ^{2, 3}, Zheng Ding ³, Yilin Wang ^{2, 3}, Zhuowen Tu ³,

¹Princeton University, ²Tsinghua University, ³University of California, San Diego

Project done while Zirui Wang, Zhizhou Sha and Yilin Wang interned at UC San Diego.

CVPR 2024

A Stable Diffusion model finetuned with token-wise consistency terms for enhanced multi-category instance composition and photorealism.

Abstract

We present TokenCompose, a Latent Diffusion Model for text-to-image generation that achieves enhanced consistency between user-specified text prompts and model-generated images. Despite its tremendous success, the standard denoising process in the Latent Diffusion Model takes text prompts as conditions only, absent explicit constraint for the consistency between the text prompts and the image contents, leading to unsatisfactory results for composing multiple object categories. Our proposed TokenCompose aims to improve multi-category instance composition by introducing the token-wise consistency terms between the image content and object segmentation maps in the finetuning stage. TokenCompose can be applied directly to the existing training pipeline of text-conditioned diffusion models without extra human labeling information. By finetuning Stable Diffusion with our approach, the model exhibits significant improvements in multi-category instance composition and enhanced photorealism for its generated images.

Given a user-specified text prompt consisting of object compositions that are unlikely to appear simultaneously in a natural scene, our proposed TokenCompose method attains significant performance enhancement over the baseline Latent Diffusion Model (e.g., Stable Diffusion) by being able to generate multiple categories of instances from the prompt more accurately.

Performance of our model in comparison to baselines

We evaluate the performance based on multi-category instance composition (i.e., Object Accuracy (OA) from VISOR Benchmark and MG2-5 from our MultiGen Benchmark), photorealism (i.e., FID from COCO and Flickr30K Entities validation splits), and inference efficiency. All comparisons are based on Stable Diffusion 1.4.

An overview of the training pipeline

Given a training prompt that faithfully describes an image, we adopt a POS tagger and Grounded SAM to extract all binary segmentation maps of the image corresponding to noun tokens from the prompt. Then, we jointly optimize the denoising U-Net of the diffusion model with both its original denoising and our token-wise objective.

Loss Illustration

We present a short video to illustrate how Token Loss and Pixel Loss work together to improve the performance of Stable Diffusion 1.4.

Cross Attention Map Comparison

We present several comparisons of attention maps between SD 1.4 and our model. While Stable Diffusion 1.4 struggles to distinguish objects in its cross-attention map, our model excels in effectively grounding the objects.

cake

keyboard

cat

backpack

apple

orange

elephant

suitcase

apple

bench

helicopter

Visualization by Timestep

We provide visualizations of the cross-attention map by timestep from our Stable Diffusion 1.4 model in comparison to a pretrained Stable Diffusion 1.4 model. Our finetuned model's cross-attention map exhibits significantly stronger grounding capabilities.

Loading...

Visualization by Attention Heads

We provide visualizations of the cross-attention map by different heads from our Stable Diffusion 1.4 model in comparison to a pretrained Stable Diffusion 1.4 model. Although constrained by grounding objectives, each attention head from our model still exhibits similar differences in activations compared to the pretrained model.

Loading...

Visualization by Attention Layers

We provide visualizations of the cross-attention map by different layers with grounding objectives from our Stable Diffusion 1.4 model in comparison to a pretrained Stable Diffusion 1.4 model. Our finetuned model exhibits consistent activation regions across cross-attention layers with different resolutions.

Loading...

Qualitative comparison between baselines and our model

We demonstrate the effectiveness of our training framework in multi-category instance composition compared with a frozen Stable Diffusion Model , Composable Diffusion, Structured Diffusion, Layout Guidance Diffusion, and Attend-and-Excite. The first three columns show composition of two categories that is deemed difficult to be generated from a pretrained Stable Diffusion model (due to rare chances of co-occurrence or significant difference in instance sizes in the real world). The last three columns show the composition of three categories where composing them requires understanding of visual representations of each text token.

Citation

          @InProceedings{Wang2024TokenCompose,
            author    = {Wang, Zirui and Sha, Zhizhou and Ding, Zheng and Wang, Yilin and Tu, Zhuowen},
            title     = {TokenCompose: Text-to-Image Diffusion with Token-level Supervision},
            booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
            month     = {June},
            year      = {2024},
            pages     = {8553-8564}
        }