Cross-media Image-Text Retrieval Based on Two-Level Network

Zhixin Li,Fengqi Zhang,Feng Ling,Canlong Zhang

doi:10.1007/978-3-030-36708-4_18

Abstract

Cross-media retrieval is to find the relationship between different modal samples, and to use some modal samples to search for other modal samples of approximate semantics. The existing cross-media retrieval method only utilizes the information of the image and part of the text, that is, the whole image and the whole sentence are matched, or some image areas and some words are matched. In order to make better use of the integrated features of image and text, this paper proposes a cross-media image-text retrieval method that integrates two-level similarity to explore better matching between image and text semantics. Specifically, in this method, the image is divided into the whole picture and some image areas, the text is divided into the whole sentences and some words, to study respectively, to explore the full potential alignment of images and text, and then a two-level alignment framework is used to promote each other, fusion of two kinds of similarity can learn to complete representation of cross-media retrieval. Experimental results on the Flickr30K and MS-COCO datasets show that this model has a better recall rate than many of the current internationally advanced cross-media retrieval models.

Full Text