Abstract

Zero-Shot Learning (ZSL) aims to recognise unseen object classes which are not observed during the training phase. Most of the existing methods on ZSL focus on learning a compatibility function between the image representation and class semantic information. Few others concentrate on learning image representation by combining local and global features. However, the existing approaches still fail to address the bias issue towards the seen classes. This paper proposes implicit and explicit attention mechanisms to address the existing bias problem in generalised ZSL models. We formulate the implicit attention mechanism with a self-supervised image angle rotation task, which focuses on specific image features aiding in solving the task. On the other hand, the explicit attention mechanism is composed via the consideration of a multi-headed self-attention mechanism in the Vision Transformer model, which learns to attend important image locations and map global image features to semantic space during the training stage. We have conducted comprehensive experiments on three popular benchmarks: AWA2, CUB and SUN, where the effectiveness of our proposed attention mechanisms is shown in both discriminative and generative settings. Our extensive experiments show that our method has achieved state-of-the-art performance obtaining the highest harmonic mean on all three datasets, which is very encouraging to consider the ViT-based attention mechanisms for ZSL tasks in the future.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call