SegLD: Achieving universal, zero-shot and open-vocabulary segmentation through multimodal fusion via latent diffusion processes

Hongtao Zheng,Yifei Ding,Zilong Wang,Xinyan Huang

doi:10.1016/j.inffus.2024.102509

Abstract

Open-vocabulary learning can identify categories marked during training (seen categories) and generalize to categories not annotated in the training set (unseen categories). It could theoretically extend segmentation systems to more universal applications. However, current open-vocabulary segmentation frameworks are primarily suited for specific tasks or require retraining according to the task, and they significantly underperform in inferring seen categories compared to fully supervised frameworks. Therefore, we introduce a universal open-vocabulary segmentation framework based on the latent diffusion process (SegLD), which requires only a single training session on a panoptic dataset to achieve inference across all open-vocabulary segmentation tasks, and reaches SOTA segmentation performance for both seen and unseen categories in every task. Specifically, SegLD comprises two stages: in the first stage, we deploy two parallel latent diffusion processes to deeply fuse the text (image caption or category labels) and image information, further aggregating the multi-scale features output from both latent diffusion processes on a scale basis. In the second stage, we introduce text queries, text list queries, and task queries, facilitating the learning of inter-category and inter-task differences through the computation of contrastive losses between them. Text queries are then further fed into a Transformer Decoder to obtain category-agnostic segmentation masks. Then we establish classification loss functions for the type of text input during training, whether image captions or category labels, to help assign a category label from the open vocabulary to each predicted binary mask. Experimental results show that, with just a single training session, SegLD significantly outperforms other contemporary SOTA fully supervised segmentation frameworks and open-vocabulary segmentation frameworks across almost all evaluation metrics for both known and unknown categories on the ADE20K, Cityscapes, and COCO datasets. This highlights SegLD’s capability as a universal segmentation framework, with the potential to replace other segmentation frameworks and adapt to various segmentation domains. The project link for SegLD is https://zht-segld.github.io/.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SegLD: Achieving universal, zero-shot and open-vocabulary segmentation through multimodal fusion via latent diffusion processes

Abstract

Talk to us

Similar Papers

More From: Information Fusion

Lead the way for us

Similar Papers

Assessment Of Neurological Function In Mixed Martial Arts Fighters Following A Single Training Session
Daniel N Poel ... Thayne A Munce
Medicine & Science in Sports & Exercise | VOL. 52
Daniel N Poel, et. al.Daniel N Poel ... Thayne A Munce
01 Jul 2020
Medicine & Science in Sports & Exercise | VOL. 52

Determining the Corticospinal Responses to Single Bouts of Skill and Strength Training
Joel Mason ... Timo Rantalainen
Journal of Strength and Conditioning Research | VOL. 33
Joel Mason, et. al.Joel Mason ... Timo Rantalainen
01 Sep 2019
Journal of Strength and Conditioning Research | VOL. 33

Editor's evaluation: Memory for incidentally learned categories evolves in the post-learning interval
Maria Chait
-
Maria ChaitMaria Chait
13 Sep 2022
13 Sep 2022

Decision letter: Memory for incidentally learned categories evolves in the post-learning interval
Maria Chait ... Joshua I Gold
-
Maria Chait, et. al.Maria Chait ... Joshua I Gold
13 Sep 2022
13 Sep 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SegLD: Achieving universal, zero-shot and open-vocabulary segmentation through multimodal fusion via latent diffusion processes

Abstract

Talk to us

Similar Papers

More From: Information Fusion