Abstract

Learning delicate feature representation of object parts plays a critical role in fine-grained visual classification tasks. However, advanced deep convolutional neural networks trained for general visual classification tasks usually tend to focus on the coarse-grained information while ignoring the fine-grained one, which is of great significance for learning discriminative representation. In this work, we explore the great merit of multi-modal data in introducing semantic knowledge and sequential analysis techniques in learning hierarchical feature representation for generating discriminative fine-grained features. To this end, we propose a novel approach, termed Channel Cusum Attention ResNet (CCA-ResNet ), for multi-modal joint learning of fine-grained representation. Specifically, we use feature-level multi-modal alignment to connect image and text classification models for joint multi-modal training. Through joint training, image classification models trained with semantic level labels tend to focus on the most discriminative parts, which enhances the cognitive ability of the model. Then, we propose a Channel Cusum Attention (CCA ) mechanism to equip feature maps with hierarchical properties through unsupervised reconstruction of local and global features. The benefits brought by the CCA are in two folds: a) allowing fine-grained features from early layers to be preserved in the forward propagation of deep networks; b) leveraging the hierarchical properties to facilitate multi-modal feature alignment. We conduct extensive experiments to verify that our proposed model can achieve state-of-the-art performance on a series of fine-grained visual classification benchmarks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call