Abstract

Fine-grained visual categorization (FGVC) has attracted extensive attention in recent years. The general pipeline of current FGVC techniques is to 1) locate the discriminative regions; 2) extract features from each region independently; and 3) feed the integrated features to a classifier. In this paper, we re-investigate the pipeline from the view of human visual recognition mechanisms. The perceiving of discriminative regions is a temporal processing by the human visual system (HVS) via the attention-shift mechanism. However, the existing independent feature extracting and one-pass feeding strategy ignore the inherent semantic relationships among discriminative regions, and thus is improper to model the attention-shift process properly. Therefore, in this paper, we propose a novel end-to-end FGVC network structure named Attention-Shift based Deep Neural Network (AS-DNN) to locate the discriminative regions automatically and encode the semantic correlations iteratively. AS-DNN consists of two channels: 1) the global perception channel Cglb and 2) the attention-shift channel Csft, simulating the global perception and the attention-shift mechanism, respectively. Experimental results show that AS-DNN achieves state-of-the-art performances by outperforming both the CNN-based weakly or strongly-supervised FGVC algorithms on several widely-used fine-grained datasets, and the visualization of attention regions exhibit that the proposed method can locate the discriminative regions robustly in complex backgrounds and postures.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call