Abstract

This paper presents a novel approach for cross-modal retrieval in an Adversarial Learning with Wasserstein Distance (ALWD) manner, which aims at learning aligned representation for various modalities in a GAN framework. The generator projects the image and the text features into an aligned representation space, while the discriminator ensures that the image and text features are not too far from each other, in a way which would maintain the semantic relation between the input samples. That is, ALWD reformulates the cross-modal retrieval as an image-text domain adaptation problem aiming at reducing domain discrepancy. To learn domain invariant representations, a domain critic network is adopted to estimate Wasserstein distance between different modal distributions and the feature extractor network is optimized to minimize the Wasserstein distance under an adversarial manner. Meanwhile, ALWD introduces additive margin softmax function to make sure the learned representations should also be discriminative in label prediction. Furthermore, a structure preservation constraint is imposed to keep local structure consistent during the learning process. Extensive comparison experiments on three widely used datasets demonstrate that ALWD outperforms the state-of-the-art cross-modal retrieval methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.