Conceptual and Syntactical Cross-modal Alignment with Cross-level Consistency for Image-Text Matching

Pengpeng Zeng,Shuaiqi Jing,Jingkuan Song,Xinyu Lyu,Lianli Gao

doi:10.1145/3474085.3475380

Abstract

Image-Text Matching (ITM) is a fundamental and emerging task, which plays a key role in cross-modal understanding. It remains a challenge because prior works mainly focus on learning fine-grained (i.e. coarse and/or phrase) correspondence, without considering the syntactical correspondence. In theory, a sentence is not only a set of words or phrases but also a syntactic structure, consisting of a set of basic syntactic tuples (i.e.(attribute) object - predicate - (attribute) subject). Inspired by this, we propose a Conceptual and Syntactical Cross-modal Alignment with Cross-level Consistency (CSCC) for Image-text Matching by simultaneously exploring the multiple-level cross-modal alignments across the concept and syntactic with a consistency constraint. Specifically, a conceptual-level cross-modal alignment is introduced for exploring the fine-grained correspondence, while a syntactical-level cross-modal alignment is proposed to explicitly learn a high-level syntactic similarity function. Moreover, an empirical cross-level consistent attention loss is introduced to maintain the consistency between cross-modal attentions obtained from the above two cross-modal alignments. To justify our method, comprehensive experiments are conducted on two public benchmark datasets, i.e. MS-COCO (1K and 5K) and Flickr30K, which show that our CSCC outperforms state-of-the-art methods with fairly competitive improvements.

Full Text