Abstract

Image-Text Matching (ITM) is a fundamental and emerging task, which plays a key role in cross-modal understanding. It remains a challenge because prior works mainly focus on learning fine-grained (i.e. coarse and/or phrase) correspondence, without considering the syntactical correspondence. In theory, a sentence is not only a set of words or phrases but also a syntactic structure, consisting of a set of basic syntactic tuples (i.e.(attribute) object - predicate - (attribute) subject). Inspired by this, we propose a Conceptual and Syntactical Cross-modal Alignment with Cross-level Consistency (CSCC) for Image-text Matching by simultaneously exploring the multiple-level cross-modal alignments across the concept and syntactic with a consistency constraint. Specifically, a conceptual-level cross-modal alignment is introduced for exploring the fine-grained correspondence, while a syntactical-level cross-modal alignment is proposed to explicitly learn a high-level syntactic similarity function. Moreover, an empirical cross-level consistent attention loss is introduced to maintain the consistency between cross-modal attentions obtained from the above two cross-modal alignments. To justify our method, comprehensive experiments are conducted on two public benchmark datasets, i.e. MS-COCO (1K and 5K) and Flickr30K, which show that our CSCC outperforms state-of-the-art methods with fairly competitive improvements.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.