Integrating Language Guidance Into Image-Text Matching for Correcting False Negatives

Zheng Li,Zhongtian Du,Zerun Feng,Jenq-Neng Hwang,Caili Guo

doi:10.1109/tmm.2023.3261443

Abstract

Image-Text Matching (ITM) aims to establish the correspondence between images and sentences. ITM is fundamental to various vision and language understanding tasks. However, there are limitations in the way existing ITM benchmarks are constructed. The ITM benchmark collects pairs of images and sentences during construction. Therefore, only samples that are paired at collection are annotated as positive. All other samples are annotated as negative. Many correlations are missed in these samples that are annotated as negative. For example, a sentence matches only one image at the time of collection. Only this image is annotated as positive for the sentence. All other images are annotated as negative. However, these negative images may contain images that correspond to the sentences. These mislabeled samples are called <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">false negatives</i> . Existing ITM models are optimized based on annotations containing mislabels, which can introduce noise during training. In this paper, we propose an ITM framework integrating Language Guidance ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">LG</i> ) for correcting false negatives. A language pre-training model is introduced into the ITM framework to identify false negatives. To correct false negatives, we propose language guidance loss, which adaptively corrects the locations of false negatives in the visual-semantic embedding space. Extensive experiments on two ITM benchmarks show that our method can improve the performance of existing ITM models. To verify the performance of correcting false negatives, we conduct further experiments on ECCV Caption. ECCV Caption is a verified dataset where false negatives in annotations have been corrected. The experimental results show that our method can recall more relevant false negatives. The code is available at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/AAA-Zheng/LG_ITM</uri> .

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Integrating Language Guidance Into Image-Text Matching for Correcting False Negatives

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Multimedia

Lead the way for us

Journal: IEEE Transactions on Multimedia	Publication Date: Jan 1, 2024
Citations: 3

Similar Papers

Cross-modal multi-relationship aware reasoning for image-text matching
Jin Zhang ... Luping Liu
Multimedia Tools and Applications | VOL. 81
Jin Zhang, et. al.Jin Zhang ... Luping Liu
27 Jan 2021
Multimedia Tools and Applications | VOL. 81

Softmax Pooling for Super Visual Semantic Embedding
Zhixian Zeng ... Yuxin Xu
-
Zhixian Zeng, et. al.Zhixian Zeng ... Yuxin Xu
27 Oct 2021
27 Oct 2021

An end-to-end image-text matching approach considering semantic uncertainty
Gulanbaier Tuerhong ... Mairidan Wushouer
Neurocomputing | VOL. 607
Gulanbaier Tuerhong, et. al.Gulanbaier Tuerhong ... Mairidan Wushouer
15 Aug 2024
Neurocomputing | VOL. 607

Global-local fusion based on adversarial sample generation for image-text matching
Shichen Huang ... Shuai Liu
Information Fusion | VOL. 103
Shichen Huang, et. al.Shichen Huang ... Shuai Liu
20 Oct 2023
Information Fusion | VOL. 103

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Integrating Language Guidance Into Image-Text Matching for Correcting False Negatives

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Multimedia