PCSformer: Pair-wise Cross-scale Sub-prototypes mining with CNN-transformers for weakly supervised semantic segmentation

Chunmeng Liu,Yao Shen,Qingguo Xiao,Guangyao Li

doi:10.1016/j.neucom.2024.127834

Abstract

Generating initial seeds is an important step in weakly supervised semantic segmentation (WSSS). Our approach concentrates on generating and refining initial seeds. The convolutional neural networks (CNNs)–based initial seeds focus only on the most discriminative regions and lack global information about the target. The Vision Transformer (ViT)–based approach can capture long-range feature dependencies due to the unique advantage of the self-attention mechanism. Still, we find that it suffers from distractor object leakage and background leakage problems. Based on these observations, we propose PCSformer, which improves the model’s ability to extract features through a Pair-wise Cross-scale (PC) strategy and solves the problem of distractor object leakage by further extracting potential target features through Sub-Prototypes (SP) mining. In addition, the proposed Conflict Self-Elimination (CSE) module further alleviates the background leakage problem. We validate our approach on the widely adopted Pascal VOC 2012 and MS COCO 2014, and extensive experiments demonstrate our superior performance. Furthermore, our method proves to be competitive for WSSS in medical images and challenging scenarios involving deformable and cluttered scenes. Additionally, we extend the PCSformer to weakly supervised object localization tasks, further highlighting its scalability and versatility. The code is available at https://github.com/ChunmengLiu1/PCSformer.

Full Text