Abstract

The pre-trained language models (PLMs), such as BERT, have been successfully employed in two-phases ranking pipeline for information retrieval (IR). Meanwhile, recent studies have reported that BERT model is vulnerable to imperceptible textual perturbations on quite a few natural language processing (NLP) tasks. As for IR tasks, current established BERT re-ranker is mainly trained on large-scale and relatively clean dataset, such as MS MARCO, but actually noisy text is more common in real-world scenarios, such as web search. In addition, the impact of within-document textual noises (perturbations) on retrieval effectiveness remains to be investigated, especially on the ranking quality of BERT re-ranker, considering its contextualized nature. To mitigate this gap, we carry out exploratory experiments on the MS MARCO dataset in this work to examine whether BERT re-ranker can still perform well when ranking text with noise. Unfortunately, we observe non-negligible effectiveness degradation of BERT re-ranker over a total of ten different types of synthetic within-document textual noise. Furthermore, to address the effectiveness losses over textual noise, we propose a novel noise-tolerant model, De-Ranker, which is learned by minimizing the distance between the noisy text and its original clean version. Our evaluation on the MS MARCO and TREC 2019–2020 DL datasets demonstrates that De-Ranker can deal with synthetic textual noise more effectively, with 3%–4% performance improvement over vanilla BERT re-ranker. Meanwhile, extensive zero-shot transfer experiments on a total of 18 widely-used IR datasets show that De-Ranker can not only tackle natural noise in real-world text, but also achieve 1.32% improvement on average in terms of cross-domain generalization ability on the BEIR benchmark.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.