Abstract
Fine-grained visual categorization (FGVC) has attracted extensive attention in recent years. The general pipeline of current FGVC techniques is to 1) locate the discriminative regions; 2) extract features from each region independently; and 3) feed the integrated features to a classifier. In this paper, we re-investigate the pipeline from the view of human visual recognition mechanisms. The perceiving of discriminative regions is a temporal processing by the human visual system (HVS) via the attention-shift mechanism. However, the existing independent feature extracting and one-pass feeding strategy ignore the inherent semantic relationships among discriminative regions, and thus is improper to model the attention-shift process properly. Therefore, in this paper, we propose a novel end-to-end FGVC network structure named Attention-Shift based Deep Neural Network (AS-DNN) to locate the discriminative regions automatically and encode the semantic correlations iteratively. AS-DNN consists of two channels: 1) the global perception channel Cglb and 2) the attention-shift channel Csft, simulating the global perception and the attention-shift mechanism, respectively. Experimental results show that AS-DNN achieves state-of-the-art performances by outperforming both the CNN-based weakly or strongly-supervised FGVC algorithms on several widely-used fine-grained datasets, and the visualization of attention regions exhibit that the proposed method can locate the discriminative regions robustly in complex backgrounds and postures.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have