Abstract

Image–text retrieval is a vital task in computer vision and has received growing attention, since it connects cross-modality data. It comes with the critical challenges of learning unified representations and eliminating the large gap between visual and textual domains. Over the past few decades, although many works have made significant progress in image–text retrieval, they are still confronted with the challenge of incomplete text descriptions of images, i.e., how to fully learn the correlations between relevant region–word pairs with semantic diversity. In this article, we propose a novel semantic completion and filtration (SCAF) method to alleviate the above issue. Specifically, the text semantic completion module is presented to generate a complete semantic description of an image using multi-view text descriptions, guiding the model to explore the correlations of relevant region–word pairs fully. Meanwhile, the adaptive structural semantic matching module is presented to filter irrelevant region–word pairs by considering the relevance score of each region–word pair, which facilitates the model to focus on learning the relevance of matching pairs. Extensive experiments show that our SCAF outperforms the existing methods on Flickr30K and MSCOCO datasets, which demonstrates the superiority of our proposed method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call