From text to mask: Localizing entities using the attention of text-to-image diffusion models

Changming Xiao,Qi Yang,Feng Zhou,Changshui Zhang

doi:10.1016/j.neucom.2024.128437

Abstract

Diffusion models have revolted the field of text-to-image generation recently. The unique way of fusing text and image information contributes to their remarkable capability of generating highly text-related images. From another perspective, these generative models imply clues about the precise correlation between words and pixels. This work proposes a simple but effective method to utilize the attention mechanism in the denoising network of text-to-image diffusion models. Without additional training time nor inference-time optimization, the semantic grounding of phrases can be attained directly. We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting and our method achieves superior performance to prior methods. In addition, the acquired word-pixel correlation is generalizable for the learned text embedding of customized generation methods, requiring only a few modifications. To validate our discovery, we introduce a new practical task called “personalized referring image segmentation” with a new dataset. Experiments in various situations demonstrate the advantages of our method compared to strong baselines on this task. In summary, our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

From text to mask: Localizing entities using the attention of text-to-image diffusion models

Abstract

Talk to us

Similar Papers

More From: Neurocomputing

Lead the way for us

Similar Papers

Information Diffusion on Complex Networks: A Novel Approach Based on Topic Modeling and Pretopology Theory
Thi Kim Thoa Ho ... Quang Vu Bui
Vietnam Journal of Computer Science | VOL. 06
Thi Kim Thoa Ho, et. al.Thi Kim Thoa Ho ... Quang Vu Bui
01 Aug 2019
Vietnam Journal of Computer Science | VOL. 06

Chapter 9 - Self-Supervised Learning from Web Data for Multimodal Retrieval
Raul Gomez ... Dimosthenis Karatzas
Multimodal Scene Understanding | VOL. -
Raul Gomez, et. al.Raul Gomez ... Dimosthenis Karatzas
01 Jan 2019
Multimodal Scene Understanding | VOL. -

Cold SegDiffusion: A novel diffusion model for medical image segmentation
Pengfei Yan ... Hao Luo
Knowledge-Based Systems | VOL. 301
Pengfei Yan, et. al.Pengfei Yan ... Hao Luo
08 Aug 2024
Knowledge-Based Systems | VOL. 301

Review on Panoptic Segmentation of Images with Text-to-Image and Image-to-Image Diffusion Models
Chetan S R ... Prof Pathanjali C
International Journal of Advanced Research in Science, Communication and Technology | VOL. -
Chetan S R, et. al. Chetan S R ... Prof Pathanjali C
06 Feb 2024
International Journal of Advanced Research in Science, Communication and Technology | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

From text to mask: Localizing entities using the attention of text-to-image diffusion models

Abstract

Talk to us

Similar Papers

More From: Neurocomputing