This paper proposes a novel transformer-based framework to generate accurate class-specific object localization maps for weakly supervised semantic segmentation (WSSS). Leveraging the insight that the attended regions of the one-class token in the standard vision transformer can generate class-agnostic localization maps, we investigate the transformer's capacity to capture class-specific attention for class-discriminative object localization by learning multiple class tokens. We present the Multi-Class Token transformer, which incorporates multiple class tokens to enable class-aware interactions with patch tokens. This is facilitated by a class-aware training strategy that establishes a one-to-one correspondence between output class tokens and ground-truth class labels. We also introduce a Contrastive-Class-Token (CCT) module to enhance the learning of discriminative class tokens, enabling the model to better capture the unique characteristics of each class. Consequently, the proposed framework effectively generates class-discriminative object localization maps from the class-to-patch attentions associated with different class tokens. To refine these localization maps, we propose the utilization of patch-level pairwise affinity derived from the patch-to-patch transformer attention. Furthermore, the proposed framework seamlessly complements the Class Activation Mapping (CAM) method, yielding significant improvements in WSSS performance on PASCAL VOC 2012 and MS COCO 2014. These results underline the importance of the class token for WSSS.
Read full abstract