| CARVIEW |
TokenCompose: Text-to-Image Diffusion with Token-level Supervision
CVPR 2024
A Stable Diffusion model finetuned with token-wise consistency terms for enhanced multi-category instance composition and photorealism.
Abstract
We present TokenCompose, a Latent Diffusion Model for text-to-image generation that achieves enhanced consistency between user-specified text prompts and model-generated images. Despite its tremendous success, the standard denoising process in the Latent Diffusion Model takes text prompts as conditions only, absent explicit constraint for the consistency between the text prompts and the image contents, leading to unsatisfactory results for composing multiple object categories. Our proposed TokenCompose aims to improve multi-category instance composition by introducing the token-wise consistency terms between the image content and object segmentation maps in the finetuning stage. TokenCompose can be applied directly to the existing training pipeline of text-conditioned diffusion models without extra human labeling information. By finetuning Stable Diffusion with our approach, the model exhibits significant improvements in multi-category instance composition and enhanced photorealism for its generated images.
Given a user-specified text prompt consisting of object compositions that are unlikely to appear simultaneously in a natural scene, our proposed TokenCompose method attains significant performance enhancement over the baseline Latent Diffusion Model (e.g., Stable Diffusion) by being able to generate multiple categories of instances from the prompt more accurately.
Performance of our model in comparison to baselines
We evaluate the performance based on multi-category instance composition (i.e., Object Accuracy (OA) from VISOR Benchmark and MG2-5 from our MultiGen Benchmark), photorealism (i.e., FID from COCO and Flickr30K Entities validation splits), and inference efficiency. All comparisons are based on Stable Diffusion 1.4.
An overview of the training pipeline
Given a training prompt that faithfully describes an image, we adopt a POS tagger and Grounded SAM to extract all binary segmentation maps of the image corresponding to noun tokens from the prompt. Then, we jointly optimize the denoising U-Net of the diffusion model with both its original denoising and our token-wise objective.
Loss Illustration
Cross Attention Map Comparison
We present several comparisons of attention maps between SD 1.4 and our model. While Stable Diffusion 1.4 struggles to distinguish objects in its cross-attention map, our model excels in effectively grounding the objects.
cake
keyboard
cat
backpack
apple
orange
elephant
suitcase
apple
bench
helicopter
Visualization by Timestep
We provide visualizations of the cross-attention map by timestep from our Stable Diffusion 1.4 model in comparison to a pretrained Stable Diffusion 1.4 model. Our finetuned model's cross-attention map exhibits significantly stronger grounding capabilities.
Visualization by Attention Heads
We provide visualizations of the cross-attention map by different heads from our Stable Diffusion 1.4 model in comparison to a pretrained Stable Diffusion 1.4 model. Although constrained by grounding objectives, each attention head from our model still exhibits similar differences in activations compared to the pretrained model.
Visualization by Attention Layers
We provide visualizations of the cross-attention map by different layers with grounding objectives from our Stable Diffusion 1.4 model in comparison to a pretrained Stable Diffusion 1.4 model. Our finetuned model exhibits consistent activation regions across cross-attention layers with different resolutions.
Qualitative comparison between baselines and our model
We demonstrate the effectiveness of our training framework in multi-category instance composition compared with a frozen Stable Diffusion Model , Composable Diffusion, Structured Diffusion, Layout Guidance Diffusion, and Attend-and-Excite. The first three columns show composition of two categories that is deemed difficult to be generated from a pretrained Stable Diffusion model (due to rare chances of co-occurrence or significant difference in instance sizes in the real world). The last three columns show the composition of three categories where composing them requires understanding of visual representations of each text token.
Citation
@InProceedings{Wang2024TokenCompose,
author = {Wang, Zirui and Sha, Zhizhou and Ding, Zheng and Wang, Yilin and Tu, Zhuowen},
title = {TokenCompose: Text-to-Image Diffusion with Token-level Supervision},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {8553-8564}
}