Abstract

Visual language pretraining (VLP) has made significant progress in improving performance on multiple visual language tasks. However, most current pre-trained models are either good at comprehension tasks or focus on generative tasks. Furthermore, performance improvements often rely primarily on expanding datasets generated by collecting noisy image-text pairs from networks that are suboptimal sources of supervision. In this paper, we propose a new VLP framework, namely BLIP, which can be flexibly applied to visual language understanding and generation tasks. BLIP effectively utilizes noisy network data by guiding subtitles. Its subtitle generator produces synthetic subtitles, and filters are used to clean these noisy subtitles. In order to meet the practical needs of existing search engines to improve retrieval speed and retrieval accuracy, this paper proposes an improved method based on the BLIP algorithm. We migrated the image and text retrieval strategy of the BLIP algorithm from itc comparison to itm comparison, and improved the model's positive and negative sample discrimination ability by using the hard-sample strategy. We further improve the retrieval accuracy of the model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call