Abstract

Cross-modal retrieval aims to retrieve one semantically similar media from multiple media types based on queries entered by another type of media. An intuitive idea is to map different media data into a common space and then directly measure content similarity between different types of data. In this paper, we present a novel method, called Category Alignment Adversarial Learning (CAAL) for cross-modal retrieval. It aims to find a common representation space supervised by category information, in which the samples from different modalities can be compared directly. Specifically, CAAL firstly employs two parallel encoders to generate common representations for image and text features respectively. Furthermore, we employ two parallel GANs with category information to generate fake image and text features which next will be utilized with already generated embedding to reconstruct the common representation. At last, two joint discriminators are utilized to reduce the gap between the mapping of the first stage and the embedding of the second stage. Comprehensive experimental results on four widely-used benchmark datasets demonstrate the superior performance of our proposed method compared with the state-of-the-art approaches.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.