Abstract

Zero-shot learning is dedicated to solving the classification problem of unseen categories, while generalized zero-shot learning aims to classify the samples selected from both seen classes and unseen classes, in which “seen” and “unseen” classes indicate whether they can be used in the training process, and if so, they indicate seen classes, and vice versa. Nowadays, with the promotion of deep learning technology, the performance of zero-shot learning has been greatly improved. Generalized zero-shot learning is a challenging topic that has promising prospects in many realistic scenarios. Although the zero-shot learning task has made gratifying progress, there is still a strong deviation between seen classes and unseen classes in the existing methods. Recent methods focus on learning a unified semantic-aligned visual representation to transfer knowledge between two domains, while ignoring the intrinsic characteristics of visual features which are discriminative enough to be classified by itself. To solve the above problems, we propose a novel model that uses the discriminative information of visual features to optimize the generative module, in which the generative module is a dual generation network framework composed of conditional VAE and improved WGAN. Specifically, the model uses the discrimination information of visual features, according to the relevant semantic embedding, synthesizes the visual features of unseen categories by using the learned generator, and then trains the final softmax classifier by using the generated visual features, thus realizing the recognition of unseen categories. In addition, this paper also analyzes the effect of the additional classifiers with different structures on the transmission of discriminative information. We have conducted a lot of experiments on six commonly used benchmark datasets (AWA1, AWA2, APY, FLO, SUN, and CUB). The experimental results show that our model outperforms several state-of-the-art methods for both traditional as well as generalized zero-shot learning.

Highlights

  • In recent years, deep learning [1,2,3,4] has achieved great success in a wide range of computer vision and machine learning tasks [5], including face recognition, emotion classification, and visual question answering

  • We propose a dual generative framework to synthesize visual feature representations of unseen classes stably and efficiently. e dual generative network combines the strengths of improved WGAN and conditional VAE, which can deal with the mode collapse and unstable training problems well

  • The encoder, the generator, and the discriminator are all implemented as multilayer perceptron (MLP). rough experiments, we find that when the dimensions of semantic embeddings s and Gaussian random noise z ∼ N(0, 1) are the same, the performance of zero-shot learning is the best

Read more

Summary

Introduction

Deep learning [1,2,3,4] has achieved great success in a wide range of computer vision and machine learning tasks [5], including face recognition, emotion classification, and visual question answering. In most cases, these deep learning models are more effective than human beings in many aspects, because they can observe potential information that may be ignored by human eyes in pictures. The concept of zero-shot learning has been put forward, which has attracted wide attention in the field of computer vision and has been greatly developed

Objectives
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call