Abstract
Information Retrieval (IR) constitutes a vital facet of Open-Domain Question Answering (ODQA) systems, focusing on the exploration of pertinent information within extensive collections of passages, such as Wikipedia, to facilitate subsequent reader processing. Historically, information retrieval relied on textual overlaps for relevant context retrieval, employing methods like BM25 and TF-IDF, which, however, lacked natural language understanding. The advent of deep learning ushered in a new era, leading to the introduction of Dense Passage Retrievers (DPR), shows superiority over traditional sparse retrievers. These dense retrievers leverage Pre-trained Language Models (PLMs) to initialize context encoders, enabling the extraction of natural language representations. They utilize the distance between latent vectors of contexts as a metric for assessing similarity. However, DPR methods are heavily reliant on large volumes of meticulously labeled data, such as Natural Questions. The process of data labeling is both costly and time-intensive. In this paper, we propose a novel data augmentation methodology SDA (Self Data Augmentation) that employs DPR models to automatically annotate unanswered questions. Specifically, we initiate the process by retrieving relevant pseudo passages for these unlabeled questions. We subsequently introduce three distinct passage selection methods to annotate these pseudo passages. Ultimately, we amalgamate the pseudo-labeled passages with the unanswered questions to create augmented data. Our experimental evaluations conducted on two extensive datasets (Natural Questions and TriviaQA), alongside a reletively small dataset (WebQuestions), utilizing three diverse base models, illustrate the significant enhancement achieved through the incorporation of freshly augmented data. Moreover, our proposed data augmentation method exhibits remarkable flexibility, which is readily adaptable to various dense retrievers. Additionally, we have conducted a comprehensive human study on the augmented data, which further supports our conclusions.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Similar Papers
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.