Abstract

Traditional fine-grained image recognition is required to distinguish different subordinate categories ( e.g. , birds species) based on the visual cues beneath raw images. Due to both small inter-class variations and large intra-class variations, it is desirable to capture the subtle differences between these sub-categories, which is crucial but challenging for fine-grained recognition. Recently, language modality aggregation has been proved as a successful technique to improve visual recognition in the experience. In this paper, we introduce an end-to-end trainable Progressive Mask Attention (PMA) model for fine-grained recognition by leveraging both visual and language modalities. Our Bi-Modal PMA model can not only stage-by-stage capture the most discriminative part in the visual modality by our mask-based fashion, but also explore the out-of-visual-domain knowledge from the language modality in an interactional alignment paradigm. Specifically, at each stage, a self-attention module is proposed to attend to the key patch from images or text descriptions. Besides, a query-relational module is designed to seize the key words/phrases of texts and further bridge the connection between two modalities. Later, the learned representations of bi-modality from multiple stages are aggregated as the final features for recognition. Our Bi-Modal PMA model only needs raw images and raw text descriptions, without requiring bounding boxes/part annotations in images or key word annotations in texts. By conducting comprehensive experiments on fine-grained benchmark datasets, we demonstrate that the proposed method achieves superior performance over the competing baselines, on either vision and language bi-modality or single visual modality.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call