Abstract

Weakly supervised object localization (WSOL), which trains object localization models using solely image category annotations, remains a challenging problem. Existing approaches based on convolutional neural networks (CNNs) tend to miss full object extent while activating discriminative object parts. Based on our analysis, this is caused by CNN's intrinsic characteristics, which experiences difficulty to capture object semantics at long distances. In this article, we introduce the vision transformer to WSOL, with the aim to capture long-range semantic dependency of features by leveraging transformer's cascaded self-attention mechanism. We propose the token semantic coupled attention map (TS-CAM) method, which first decomposes class-aware semantics and then couples the semantics with attention maps for semantic-aware activation. To capture object semantics at long distances and avoid partial activation, TS-CAM performs spatial embedding by partitioning an image to a set of patch tokens. To incorporate object category information to patch tokens, TS-CAM reallocates category-related semantics to each patch token. The patch tokens are finally coupled with attention maps which are semantic-agnostic to perform semantic-aware object localization. By introducing semantic tokens to produce semantic-aware attention maps, we further explore the capability of TS-CAM for multicategory object localization. Experiments show that TS-CAM outperforms its CNN-CAM counterpart by 11.6% and 28.9% on ILSVRC and CUB-200-2011 datasets, respectively, improving the state-of-the-art with large margins. TS-CAM also demonstrates superiority for multicategory object localization on the Pascal VOC dataset. The code is available at github.com/yuanyao366/ts-cam-extension.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call