Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching

Huatian Zhang,Yongdong Zhang,Kun Zhang,Zhendong Mao

doi:10.1609/aaai.v36i3.20235

Abstract

Image-text matching bridges vision and language, which is a crucial task in the field of multi-modal intelligence. The key challenge lies in how to measure image-text relevance accurately as matching evidence. Most existing works aggregate the local semantic similarities of matched region-word pairs as the overall relevance, and they typically assume that the matched pairs are equally reliable. However, although a region-word pair is locally matched across modalities, it may be inconsistent/unreliable from the global perspective of image-text, resulting in inaccurate relevance measurement. In this paper, we propose a novel Cross-Modal Confidence-Aware Network to infer the matching confidence that indicates the reliability of matched region-word pairs, which is combined with the local semantic similarities to refine the relevance measurement. Specifically, we first calculate the matching confidence via the relevance between the semantic of image regions and the complete described semantic in the image, with the text as a bridge. Further, to richly express the region semantics, we extend the region to its visual context in the image. Then, local semantic similarities are weighted with the inferred confidence to filter out unreliable matched pairs in aggregating. Comprehensive experiments show that our method achieves state-of-the-art performance on benchmarks Flickr30K and MSCOCO.

Full Text