Due to the inconsistency in feature representations between different modalities, namely “Heterogeneous gap”, it remains a persistent challenge to correlate images and texts. Existing studies on image–text retrieval (ITR) mainly emphasize on inter-modal correlation learning through aligning instances or their patches from different modalities. However, it is hard to break through performance bottlenecks of ITR without powerfully supporting from intra-modal correlation. Unfortunately, few studies have sufficiently considered two critical tasks in intra-modal correlation learning: (1) intricate contextual information perceiving, and (2) intrinsic semantic relationships reasoning. Therefore, in this paper, we propose the Context-guided Cross-modal Correlation Learning (CCCL) framework for ITR under a novel paradigm: “Perceive, Reason, and Align”. Firstly, in the stage of “Perceive”, the context-guided mechanism based on the self-attention and gate mechanism is proposed to fully discover contextual information within modalities, eliminating unnecessary interactions between local-level patches. Secondly, in the stage of “Reason”, graph convolutional network with the residual structure is used to uncover relationships among patches within each modality to make reasonable inferences. Thirdly, in the stage of “Align”, to achieve precise inter-modal alignment, the complementarity between different modalities from both global-level and local-level is effectively mined and fused. Finally, to optimize our proposed CCCL framework, the hybrid loss is constructed by combining the cross-modal coherence term with the cross-modal alignment term. Our approach yields highly competitive results on two publicly available ITR datasets, that is, Flickr30K and MS-COCO.
Read full abstract