Image-text retrieval, a fundamental cross-modal task, performs similarity reasoning for images and texts. The primary challenge for image-text retrieval is cross-modal semantic heterogeneity, where the semantic features of visual and textual modalities are rich but distinct. Scene graph is an effective representation for images and texts as it explicitly models objects and their relations. Existing scene graph based methods have not fully taken the features regarding various granularities implicit in scene graph into consideration (e.g. triplets), the inadequate feature matching incurs the absence of non-trivial semantic information (e.g. inner relations among triplets). Therefore, we propose a S emantic-Consistency E nhanced M ulti-Level Scene Graph Matching (SEMScene) network, which exploits the semantic relevance between visual and textual scene graphs from fine-grained to coarse-grained. Firstly, under the scene graph representation, we perform feature matching including low-level node matching, mid-level semantic triplet matching, and high-level holistic scene graph matching. Secondly, to enhance the semantic-consistency for object-fused triplets carrying key correlation information, we propose a dual-step constraint mechanism in mid-level matching. Thirdly, to guide the model to learn the semantic-consistency of matched image-text pairs, we devise effective loss functions for each stage of the dual-step constraint. Comprehensive experiments on Flickr30K and MS-COCO datasets demonstrate that SEMScene achieves state-of-the-art performances with significant improvements.
Read full abstract