Abstract

The task of fine-grained visual classification (FGVC) is to distinguish targets from subordinate classifications. Since fine-grained images have the inherent characteristic of large inter-class variances and small intra-class variances, it is considered an extremely difficult task. Most existing approaches adopt CNN-based networks as feature extractors, which causes the extracted discriminative regions to contain most parts of the object in this way, thus failing to locate the really important parts. Recently, the vision transformer (ViT) has demonstrated its power on a wide range of image tasks, which uses an attention mechanism to capture global contextual information to establish a remote dependency on the target and thus extract more powerful features. Nevertheless, the ViT model still focuses more on global coarse-grained information rather than local fine-grained information, which may lead to its undesirable performance in fine-grained image classification. To this end, we redesigned an attention aggregating transformer (AA-Trans) to better capture minor differences among images by improving the ViT structure in this paper. In detail, we propose a core attention aggregator (CAA), which enables better information sharing between each transformer layer. Besides, we further propose an innovative information entropy selector (IES) to guide the network in acquiring discriminative parts of the image precisely. Extensive experiments show that our proposed model structure can achieve a new state-of-the-art performance on several mainstream datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call