Coarse-to-Fine Description for Fine-Grained Visual Categorization

Hantao Yao,Jintao Li,Shiliang Zhang,Qi Tian,Yongdong Zhang

doi:10.1109/tip.2016.2599102

Abstract

Recent years have witnessed the significant advance in fine-grained visual categorization, which targets to classify the objects belonging to the same species. To capture enough subtle visual differences and build discriminative visual description, most of the existing methods heavily rely on the artificial part annotations, which are expensive to collect in real applications. Motivated to conquer this issue, this paper proposes a multi-level coarse-to-fine object description. This novel description only requires the original image as input, but could automatically generate visual descriptions discriminative enough for fine-grained visual categorization. This description is extracted from five sources representing coarse-to-fine visual clues: 1) original image is used as the source of global visual clue; 2) object bounding boxes are generated using convolutional neural network (CNN); 3) with the generated bounding box, foreground is segmented using the proposed k nearest neighbour -based co-segmentation algorithm; and 4) two types of part segmentations are generated by dividing the foreground with an unsupervised part learning strategy. The final description is generated by feeding these sources into CNN models and concatenating their outputs. Experiments on two public benchmark data sets show the impressive performance of this coarse-to-fine description, i.e. , classification accuracy achieves 82.5% on CUB-200-2011, and 86.9% on fine-grained visual categorization-Aircraft, respectively, which outperform many recent works.

Full Text