Attention Bilinear Pooling for Fine-Grained Classification

Wang Wang,Zhang Zhang

doi:10.3390/sym11081033

Abstract

Fine-grained image classification is a challenging problem because of its large intra-class differences and low inter-class variance. Bilinear pooling based models have been shown to be effective at fine-grained classification, while most previous approaches neglect the fact that distinctive features or modeling distinguishing regions usually have an important role in solving the fine-grained problem. In this paper, we propose a novel convolutional neural network framework, i.e., attention bilinear pooling, for fine-grained classification with attention. This framework can learn the distinctive feature information from the channel or spatial attention. Specifically, the channel and spatial attention allows the network to better focus on where the key targets are in the image. This paper embeds spatial attention and channel attention in the underlying network architecture to better represent image features. To further explore the differences between channels and spatial attention, we propose channel attention bilinear pooling (CAB), spatial attention bilinear pooling (SAB), channel spatial attention bilinear pooling (CSAB), and spatial channel attention bilinear pooling (SCAB) as four alternative frames. A variety of experiments on several datasets show that our proposed method has a very impressive performance compared to other methods based on bilinear pooling.

Highlights

As an important branch of artificial intelligence, computer vision deals with how computers can be made to gain a high-level understanding from digital images or videos, so as to complete object recognition [1,2,3], detection [4,5], classification [6,7], and other vision-related tasks
At theAt same explored the channel of the themodel, model,which which is more useful for classification the classification the time, samewe time, we explored the attention, spatial attention, different channel spatial attention, and spatial attention double bilinear channel attention, spatial attention, different channel spatial attention, and spatial attention double pooling study theto difference between channelbetween and spatial detection classification results
VGG-16 is often used as the primary model for fine-grained image classification because of its powerful generalization ability, so we focused on conv5_3 in VGG-16 with the channel attention module, spatial attention module, and double attention module respectively

Summary

Introduction

As an important branch of artificial intelligence, computer vision deals with how computers can be made to gain a high-level understanding from digital images or videos, so as to complete object recognition [1,2,3], detection [4,5], classification [6,7], and other vision-related tasks. The classification of coarse-grained images differs greatly from each other, and there is no obvious subordinate relationship between the categories and it is easy to distinguish the different categories, the gap between fine-grained image classes is small, and the classification categories generally belong to different sub-categories under the same parent class. Different from the coarse-grained classification, fine-grained image classification is more difficult for the following reasons. High intra-class variance exists due to uncertain factors such as attitude, illumination, occlusion, background interference

Methods

Findings

Discussion

Conclusion