Abstract
Given a text/image query, image-text retrieval aims to find the relevant items in the database. Recently, visual-linguistic pre-training (VLP) methods have demonstrated promising accuracy on image-text retrieval and other visual-linguistic tasks. These VLP methods are typically pre-trained on a large amount of image-text pairs, then fine-tuned on various downstream tasks. Nevertheless, due to the natural modality incompleteness in image-text retrieval, i.e., the query is either image or text rather than an image-text pair, the naive application of VLP to image-text retrieval results in significant inefficiency. Moreover, existing VLP methods cannot extract comparable representations for a single-modal query and multi-modal database items. In this work, we propose a generative visual-linguistic pre-training approach, termed as GilBERT, to simultaneously learn generic representations of image-text data and complete the missing modality for incomplete pairs. In testing phase, the proposed GilBERT facilitates efficient vector-based retrieval by providing unified feature embedding for query and database items. Moreover, the generative training not only makes GilBERT compatible with non-parallel text/image corpus, but also enables GilBERT to model the image-text relationships without suffering massive randomly-sampled negative samples, leading to superior experimental performances. Extensive experiments demonstrate the advantages of GilBERT in image-text retrieval, in terms of both efficiency and accuracy.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have