Abstract

Zero-shot learning (ZSL) models use semantic representations of visual classes to transfer the knowledge learned from a set of training classes to a set of unknown test classes. In the context of generic object recognition, previous research has mainly focused on developing custom architectures, loss functions, and regularization schemes for ZSL using word embeddings as semantic representation of visual classes. In this paper, we exclusively focus on the affect of different semantic representations on the accuracy of ZSL. We first conduct a large scale evaluation of semantic representations learned from either words, text documents, or knowledge graphs on the standard ImageNet ZSL benchmark. We show that, using appropriate semantic representations of visual classes, a basic linear regression model outperforms the vast majority of previously proposed approaches. We then analyze the classification errors of our model to provide insights into the relevance and limitations of the different semantic representations we investigate. Finally, our investigation helps us understand the reasons behind the success of recently proposed approaches based on graph convolution networks (GCN) which have shown dramatic improvements over previous state-of-the-art models.

Highlights

  • Recent successes in generic object recognition have largely been driven by the successful application of convolutional neural networks (CNN) trained in a supervised manner on large image datasets

  • The main result of our study is to show that a basic linear regression model using graph embeddings outperforms previous the state-of-the art zero-shot learning (ZSL) models based on word embeddings

  • 6 Conclusion Zero-shot learning has the potential to be of great practical impact and to facilitate the wide-spread use of object recognition technologies

Read more

Summary

Introduction

Recent successes in generic object recognition have largely been driven by the successful application of convolutional neural networks (CNN) trained in a supervised manner on large image datasets. Word embeddings are learned in an unsupervised manner from large text corpora so that they can be collected in a large scale without human supervision Their successful application to a number of natural language processing (NLP) tasks has shown that word embedding representations encode a number of desirable semantic features, which have been naturally assumed to generalize to vision tasks. For these desirable properties, word embeddings have become the standard

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.