The weakly supervised object localization (WSOL) has always been a very challenging research subject in the field of computer vision, which aims to predict the localization of objects in an image using only an image-level class labeling approach. The traditional convolutional neural network (CNN) based-approaches utilize the local class activation and discrimination for classification guidance, and the biggest drawback of CNN is that it cannot capture the remote feature dependencies between pixels. Recently, the transformer architecture has been deployed in the WSOL, but the transformer cannot well capture local features. To address the above problems, we propose HiCT (Hierarchical comprehend of transformer), a simple and effective visual converter variation method. Moreover, we also propose a discriminative-based attention layer (DAL), which aims to mine the local feature information by utilizing the global token attention graph mechanism. To further improve the coverage of object localization, we introduce the spatial aware digging module (SADM). In addition, a set of complementarity loss calculators to patch hierarchy (CPH) is proposed to improve the sample class aggregation capability of our model. Finally, we conducted experiments on two commonly used datasets of CUB-200-2011 and ILSVRC, so as to verify the effectiveness of our method.
Read full abstract