Abstract

Zero-shot image recognition aims to classify data from unseen classes, by exploring the association between visual features and the semantic representations of each class. Most existing approaches focus on learning a shared single-scale embedding space (often at the output layer of the network) for both visual and semantic features, ignoring a fact that different-scale visual features exhibit different semantics. In this article, we propose a multi-scale visual-attribute co-attention (mVACA) model, considering both visual-semantic alignment and visual discrimination at multiple scales. At each scale, a hybrid visual attention is realized by attribute-related attention and visual self-attention. The attribute-related attention is guided by a pseudo attribute vector inferred via a mutual information regularization (MIR). The visual self-attentive features further influence the attribute attention to emphasize visual-associated attributes. Leveraging multiscale visual discrimination, mVACA unifies standard zero-shot learning (ZSL) and generalized ZSL tasks in one framework, achieving state-of-the-art or competitive performance on several commonly used benchmarks of both setups. To better understand the interaction between images and attributes in mVACA, we also provide visualized analysis.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call