Abstract

Accurate attribute representations are critically important in Zero-shot Learning (ZSL) as most ZSL methods need utilize the shared visual-semantic embedding to transfer knowledge from seen to unseen classes. However, many existing works directly recognize semantic attributes using a common image classification framework, which could fail since the differences of attribute representations among various images are ignored. We claim in this paper that attribute annotations contain complementary information that should be handled separately to better recognize them. To this end, our method consists of two branches: the Attribute Refinement by Localization (ARL) branch and the Visual-Semantic Interaction (VSI) branch. The ARL is used to refine the representations of tangible attribute information by channel selection and spatial suppression, which can more accurately localize the visual information relevant to an attribute. To effectively model the abstract attribute information, our VSI branch performs visual-semantic interaction by integrating the attribute prototypes into the visual features. By combining the proposed two branches, we can accurately model the complementary attribute information for ZSL. Extensive experiments are conducted on three benchmark datasets, and the results validate the effectiveness of our proposed method with considerable performance improvement over state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call