Vision-Language Pre-training (VLP) has shown promising performance in various tasks by learning a generic image-text representation space. However, most existing VLP methods encounter the Noisy Correspondence (NC) problem which refers to wrongly matched image-text pairs harvested from the wild. In this paper, we empirically study the influence of NC on the VLP model and obtain the following two observations. First, the NC will largely degrade the performance in downstream tasks even via fine-tuning, indicating the necessity of handling NC in the pre-training period. Second, the influence of NC varies in different pre-training objectives, suggesting the objective-customized solution for achieving NC robustness. Based on the above observations, we propose a novel NoisE-robust Vision-languagE pRe-training method (NEVER) to endow the VLP model with robustness against NC. In brief, NEVER first divides the training data into clean and noisy subsets in a progressive and adaptive manner. Then NEVER employs the positive learning (PL) and negative learning (NL) on the splits to enjoy model convergence and noise robustness, respectively. To further handle the false negative in PL and NL, NEVER proposes to smoothen and sharpen the training targets with the predictions from a twin momentum model. Extensive experiments on the various V+L tasks verify the effectiveness of the proposed method. The code will be released upon acceptance.
Read full abstract