Abstract

Zero-shot learning aims to recognize image categories which are “unseen” in the training phase of image classification models. The key to this task is to transfer the learned knowledge from “seen” classes to “unseen” classes. In order to make the knowledge transfer process more effective, we propose to exploit both the visual and semantic attention mechanisms simultaneously in zero-shot learning tasks. Specifically, a dual-focus transfer network (DFTN) model is proposed to implement attention mechanisms from both the visual and semantic ends in a mapping based zero-shot learning framework with a visual focus transfer (VFT) module and a semantic focus transfer (SFT) module. The VFT module is composed by multi-head self-attention networks, which endows salient parts of images with greater weights at different resolutions of the feature maps. The SFT module generates semantic weights to re-weight semantic attribute features with the guidance of visual representations, where the semantic attributes corresponding to more visual discrimination capability will obtain greater weights. Extensive experiments of zero-shot learning and generalized zero-shot learning on five representative benchmarks demonstrate the superiority of the proposed DFTN model, compared to other state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call