Fine-Grained Visual Categorization: A Spatial–Frequency Feature Fusion Perspective

Min Wang,Peng Zhao,Xizhao Wang,Xin Lu,Fan Min

doi:10.1109/tcsvt.2022.3227737

Abstract

Fine-grained visual categorization is a challenging issue owing to high intra-class and low inter-class variances. Classical approaches rely on pre-trained models or many fine annotations. In this paper, we observe that spatial and frequency information provides distinct image views, and propose a new spatial–frequency feature fusion (SFFF) perspective to handle this challenging issue. Specifically, we design a heterogeneous feature extraction loss function, construct a global and local fusion SFFF network, and propose an importance-sparsity selection strategy. For feature extraction, we focus on the frequency domain feature learning network, extract fine-grained features, and achieve feature complementarity. For feature selection, we propose importance ranking and sparse regularity to constrain spatial–frequency features. For feature fusion, we design a spatial–frequency loss and an inter-layer switching strategy to achieve local-global collaboration. Comparative experiments were performed on popular fine-grained datasets and classic datasets such as CUB200-2011, Stanford Cars, Stanford Dogs, FGVC-Aircraft, and CIFAR100. The effectiveness and outstanding performance of SFFF are confirmed by comparisons with more than 40 state-of-the-art fine-grained categorization methods. Ablation studies and visualizations are provided to facilitate an understanding of our approach.

Full Text