Iterative graph attention memory network for cross-modal retrieval

Xinfeng Dong,Huaxiang Zhang,Xiao Dong,Xu Lu

doi:10.1016/j.knosys.2021.107138

Abstract

How to eliminate the semantic gap between multi-modal data and effectively fuse multi-modal data is the key problem of cross-modal retrieval. The abstractness of semantics makes semantic representation one-sided. In order to obtain complementary semantic information for samples with the same semantics, we construct a local graph for each instance and utilize a graph feature extractor (GFE) to reconstruct the sample representation based on the adjacency relationship between the sample itself and its neighbors. Owing to the problem that some cross-modal methods only focus on the learning of paired samples and cannot integrate more cross-modal information from the other modalities, we propose a cross-modal graph attention strategy to generate the graph attention representation for each sample from the local graph of its corresponding paired sample. In order to eliminate heterogeneous gap between modalities, we fuse the features of the two modalities using a recurrent gated memory network to choose prominent features from other modalities and filter out unimportant information to obtain a more discriminative feature representation in the common latent space. Experiments on four benchmark datasets demonstrate the superiority of our proposed model compared with state-of-the-art cross-modal methods.

Full Text