Abstract

Zero-shot learning (ZSL) aims to recognize objects without seeing any visual instances by learning knowledge transfer between seen and unseen classes. Attributes, which denote high-level visual entities or visual characteristics, have been widely utilized as intermediate embedding space for knowledge transfer in a majority of existing ZSL approaches and have shown impressive performance. Attribute based ZSL approaches, which introduce an intermediate embedding space of attributes for knowledge transfer, have shown impressive performance. However, providing attribute annotations for unseen classes at test time is time-consuming and labor-intensive. Besides, directly using attributes as intermediation (embedding space) for knowledge transfer and zero-shot prediction inevitably leads to the projection domain shift and hubness problems. In this paper, we propose a novel multi-view deep neural network, termed Fusion by Synthesis (FS), which leverages word embeddings of classes as complementary for attributes and performs zero-shot prediction by fusing the word embeddings of unseen classes and the synthesized attributes in the visual feature space. Specifically, in the training phase, by considering the visual features, attributes and word embeddings as three different views of visual instances, FS allocates each view with a denoising auto-encoder to simultaneously ensures robust view-specific reconstructions and cross-view synthesizing, while preserving the discrimination of class labels. During testing, FS can synthesize the absent attributes for unseen classes and fuse them with word embeddings to the visual feature space to perform zero-shot prediction. Besides, FS is flexible to learn with partial views, where either attribute view or word embedding view is missing during training. Moreover, FS is flexible to synthesize either the missing view of attributes or word embeddings from the provided view. Extensive experiments on six benchmark datasets on both image classification and action recognition show that FS is advantaged to fuse multi-view data by synthesis and achieves superior performance compared with the state-of-the-art ZSL methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call