Abstract

In generative adversarial network (GAN) based zero-shot learning (ZSL) approaches, the synthesized unseen visual features are inevitably prone to seen classes since the feature generator is merely trained on seen references, which causes the inconsistency between visual features and their corresponding semantic attributes. This visual-semantic inconsistency is primarily induced by the non-preserved semantic-relevant components and the non-rectified semantic-irrelevant low-level visual details. Existing generative models generally tackle the issue by aligning the distribution of the two modalities with an additional visual-to-semantic embedding, which tends to cause the hubness problem and ruin the diversity of visual modality. In this paper, we propose a novel generative model named learning modality-consistent latent representations GAN (LCR-GAN) to address the problem via embedding the visual features and their semantic attributes into a shared latent space. Specifically, to preserve the semantic-relevant components, the distributions of the two modalities are aligned by maximizing the mutual information between them. And to rectify the semantic-irrelevant visual details, the mutual information between original visual features and their latent representations is confined within an appropriate range. Meanwhile, the latent representations are decoded back to both modalities to further preserve the semanticrelevant components. Extensive evaluations on four public ZSL benchmarks validate the superiority of our method over other state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call