Abstract

How to eliminate the semantic gap between multi-modal data and effectively fuse multi-modal data is the key problem of cross-modal retrieval. The abstractness of semantics makes semantic representation one-sided. In order to obtain complementary semantic information for samples with the same semantics, we construct a local graph for each instance and utilize a graph feature extractor (GFE) to reconstruct the sample representation based on the adjacency relationship between the sample itself and its neighbors. Owing to the problem that some cross-modal methods only focus on the learning of paired samples and cannot integrate more cross-modal information from the other modalities, we propose a cross-modal graph attention strategy to generate the graph attention representation for each sample from the local graph of its corresponding paired sample. In order to eliminate heterogeneous gap between modalities, we fuse the features of the two modalities using a recurrent gated memory network to choose prominent features from other modalities and filter out unimportant information to obtain a more discriminative feature representation in the common latent space. Experiments on four benchmark datasets demonstrate the superiority of our proposed model compared with state-of-the-art cross-modal methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.