Abstract

Many approaches in generalized zero-shot learning(GZSL) rely on the visual features extracted by CNN pre-trained on ImageNet and artificially designed semantic features. However, due to the cross-dataset bias, the quality of extracted visual features is hard to guarantee. In addition, due to the cognitive limitations of human beings, hand-collected attributes cannot traverse the whole visual space. Therefore, there will still be some discriminative features that are not focused on while designing the semantic space. In our work, we visualize the visual and semantic features in the benchmark datasets of GZSL, and we find that both visual and semantic features commonly suffer from entanglement problems. To this problem, we present a center-VAE to minimize the intra-class variations while preserving the inter-class relation and guiding the VAE to learn a more discriminative latent space. Besides, we deploy a regressor to reconstruct the corresponding semantic features via latent representations and guide the VAE to learn a latent space that preserves semantic-relevant information. After training of VAE, we concatenate the visual features and the corresponding latent representations to obtain the fine-tuning features, on which we train a softmax classifier. We evaluate our proposed method on several benchmark datasets and establish a new state-of-the-art. Moreover, we visualize the fine-tuning features, and the results demonstrate that our method can significantly alleviate the entanglement problem.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call