Abstract

Cross-modal retrieval aims to retrieve relevant content of different modalities by giving a query of another modality. The biggest difficulty is how to bridge the heterogeneous gap between different modalities. The commonly-used methods tend to focus on exploiting individual image-text pair and mining the relations of cross-modality data thereof, but ignore the role of multi-sample correlation. Moreover, more global, structural inter-pair knowledge contained by the training dataset will be under-used. To fully exploit graph-structured semantics and mine the semantic information in the dataset for learning discriminative representations, we propose Weighted Graph-structured Semantics Constraint Network (WGSCN), a unified, graph-based, semantic-constrained learning framework,in which GCN is used to mine comprehensive relation information from cross modality data. Our main inspiration is to design a novel two-branch GCN-based Cross-modal Semantic Encoding (GCSE) module to produce semantic embeddings with the both modality-specific and modality-shared correlation. Moreover, a GAN-based dual learning approach is used to further improve the discriminability and model the joint distribution across different modalities. Our proposed GDL uses semantic embeddings as supervisory signal to make the common representation semantically discriminative while adversarial learning and dual learning are used to make the common representation modality-invariant. Through comparative experiments on five commonly used cross-modal datasets, we have shown the superior retrieval accuracy of our WGSCN.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call