Abstract

Image-text matching has become a research hotspot in recent years. The key point of image-text matching is to accurately measure the similarity between an image and a sentence. However, most existing methods either focus on the inter-modality similarities between regions in image and words in text or the intra-modality similarities within image regions or words, such that they cannot well exploit detailed correlations between images and texts. Furthermore, existing methods typically train their models using a triplet ranking loss, which relies on the similarity of randomly sampled triples. Since the weights of positive and negative samples are not adjusted, it cannot provide enough gradient information for training, resulting in slow convergence and limited performance. To address the above problems, we propose an image-text matching method named Bi-Attention Enhanced Representation Learning (BAERL). It builds a self-attention learning sub-network to exploit intra-modality correlations within image regions or words and a co-attention learning sub-network to exploit inter-modality correlations between image regions and words. Then, representations obtained by two sub-networks capture holistic correlations between images and texts. In addition, BAERL uses the self-similarity polynomial loss instead of triplet ranking loss to train the model. The self-similarity polynomial loss can adaptively assign appropriate weights to different pairs based on their similarity scores so as to further improve the retrieval performance. Experiments on two benchmark datasets demonstrate the superior performance of the proposed BAERL method over several state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call