Existing <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">fine-grained image recognition</i> works have attempted to dig into low-level details for emphasizing subtle discrepancies among sub-categories. However, a potential limitation of these methods is that they integrate the low-level details and high-level semantics directly, and neglect their content complementarity and spatial corresponding correlation. To handle this limitation, we propose an end-to-end Semantic-guided Information Alignment Network (SIA-Net) to dynamically pick out the low-level details under the guidance of accurate semantics to make selected details spatially corresponding to high-level semantics and complementary in content. Technically, SIA-Net consists of an Accurate Semantic Calibration (ASC) module for providing accurate semantics and a Discriminative Feature Alignment (DFA) module for aggregating low-level details and high-level semantics using accurate semantics generated by ASC. ASC learns the pixel-level feature shifting caused by convolutional operations, which is utilized for replacing the incorrectly highlighted semantics by shifting discriminative semantics or background features. After obtaining the accurate semantic features, DFA digs into the complementary details and simultaneously makes the selected details spatially corresponding via applying the guidance of accurate semantics to obtain the reassembly features. Finally, the reassembly features, which serve as discriminative cues, are used for more accurate discriminative region localization. Extensive experiments verify that our proposed method yields the best performance under the same settings with the most competitive approaches on CUB-birds, Stanford-Cars, and FGVC Aircraft datasets.
Read full abstract