Abstract

Zero-shot learning (ZSL) models typically learn a cross-modal mapping between the visual feature space and the semantic embedding space. Despite promising performance achieved by existing methods, they usually take visual features from the whole image as the main proposed inputs, while pay little attention to image regions which are relevant to human’s visual response to the whole image. In this article, we propose a neural network-based ZSL model which incorporates an attention mechanism to discover the discriminative parts for each image. The proposed model allows us to automatically generate attention maps for visual parts, which provides a flexible way of encoding the salient visual aspects to distinguish the categories. Moreover, we introduce a simple yet effective objective function to exploit the pairwise label information between images and classes, resulting in substantial performance improvement. When multiple semantic spaces are available, a multiple-attention scheme is provided to fuse different semantic spaces, which helps to achieve further improvement in performance. On the widely used CUB-2010-2011 data set for fine-grained image classification, we demonstrate the advantages of using attention mechanism and semantic parts in our model for ZSL. Comprehensive experimental results show that our proposed approach achieves superior performance than the state-of-the-art methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.