Abstract

Image-text matching of natural scenes has been a popular research topic in both computer vision and natural language processing communities. Recently, fine-grained image-text matching has shown its significant advance in inferring the high-level semantic correspondence by aggregating pairwise region-word similarity, but it remains challenging mainly due to insufficient representation of high-order semantic concepts and their explicit connections in one modality as its matched in another modality. To tackle this issue, we propose a relationship-enhanced semantic graph (ReSG) model, which can improve the image-text representations by learning their locally discriminative semantic concepts and then organizing their relationships in a contextual order. To be specific, two tailored graph encoders, visual relationship-enhanced graph (VReG) and textual relationship-enhanced graph (TReG), are respectively exploited to encode the high-level semantic concepts of corresponding instances and their semantic relationships. Meanwhile, the representations of each graph node are optimized by aggregating semantically contextual information to enhance the node-level semantic correspondence. Further, the hard-negative triplet ranking loss, center hinge loss, and positive-negative margin loss are jointly leveraged to learn the fine-grained correspondence between the ReSG representations of image and text, whereby the discriminative cross-modal embeddings can be explicitly obtained to benefit various image-text matching tasks in a more interpretable way. Extensive experiments verify the advantages of the proposed fine-grained graph matching approach, by achieving the state-of-the-art image-text matching results on public benchmark datasets.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.