Cross-modal Semantically Augmented Network for Image-text Matching

Tao Yao,Yingying Zhu,Ying Li,Jun Yue,Yiru Li,Gang Wang

doi:10.1145/3631356

Abstract

Image-text matching plays an important role in solving the problem of cross-modal information processing. Since there are nonnegligible semantic differences between heterogenous pairwise data, a crucial challenge is how to learn a unified representation. Existing methods mainly rely on the alignment between regional image features and corresponding entity words. However, the regional features in the image are often more concerned with the foreground entity information, and the attribute information of the entities and the relational information are ignored. How to effectively integrate entity-attribute alignment and relationship alignment has not been fully studied. Therefore, we propose a Cross-Modal Semantically Augmented Network for Image-Text Matching (CMSAN), which combines the relationships between entities in the image with the semantics of relational words in the text. CMSAN (1) proposes an adaptive word-type prediction model that classifies the words into four types, i.e., entity word, attribute word, relation word, and unnecessary word. It can align different image features at multiple levels. CMSAN (2) designs a sophisticated relationship alignment module and an entity-attribute alignment module that maximizes the exploitation of the semantic information, which enables the model to have more discriminative power and further improves the matching accuracy.

Full Text