Abstract

Fine-grained visual classification (FGVC) is a difficult task due to the challenges of discriminative feature learning. Most existing methods directly use the final output of the network which always contains the global feature with high-level semantic information. However, the differences between fine-grained images are reflected in subtle local regions which often appear in the front of the network. When the texture of the background and object are similar or the proportion of the background is too large, the prediction will be greatly affected. In order to solve the above problems, this paper proposes multi-granularity feature fusion module (MGFF) and two-stage classification based on Vision-Transformer (ViT). The former comprehensively represents images by fusing features of different granularities, thus avoiding the limitations of single-scale features. The latter leverages the ViT model to separate the object from the background at a very small cost, thereby improving the accuracy of the prediction. We conduct comprehensive experiments and achieves the best performance in two fine-grained tasks on CUB-200-2011 and NA-Birds.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call