Abstract
Cross-modal retrieval aims to realize accurate and flexible retrieval across different modalities of data, e.g., image and text, which has achieved significant progress in recent years, especially since generative adversarial networks (GAN) were used. However, there still exists much room for improvement. How to jointly extract and utilize both the modality-specific (complementarity) and modality-shared (correlation) features effectively has not been well studied. In this paper, we propose an approach named Modality-Specific and Shared Generative Adversarial Network (MS2GAN) for cross-modal retrieval. The network architecture consists of two sub-networks that aim to learn modality-specific features for each modality, followed by a common sub-network that aims to learn the modality-shared features for each modality. Network training is guided by the adversarial scheme between the generative and discriminative models. The generative model learns to predict the semantic labels of features, model the inter- and intra-modal similarity with label information, and ensure the difference between the modality-specific and modality-shared features, while the discriminative model learns to classify the modality of features. The learned modality-specific and shared feature representations are jointly used for retrieval. Experiments on three widely used benchmark multi-modal datasets demonstrate that MS2GAN can outperform state-of-the-art related works.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have