Self Data Augmentation for Open Domain Question Answering

Qin Zhang,Mengqi Zheng,Shangsi Chen,Han Liu,Meng Fang

doi:10.1145/3707449

Abstract

Information Retrieval (IR) constitutes a vital facet of Open-Domain Question Answering (ODQA) systems, focusing on the exploration of pertinent information within extensive collections of passages, such as Wikipedia, to facilitate subsequent reader processing. Historically, information retrieval relied on textual overlaps for relevant context retrieval, employing methods like BM25 and TF-IDF, which, however, lacked natural language understanding. The advent of deep learning ushered in a new era, leading to the introduction of Dense Passage Retrievers (DPR), shows superiority over traditional sparse retrievers. These dense retrievers leverage Pre-trained Language Models (PLMs) to initialize context encoders, enabling the extraction of natural language representations. They utilize the distance between latent vectors of contexts as a metric for assessing similarity. However, DPR methods are heavily reliant on large volumes of meticulously labeled data, such as Natural Questions. The process of data labeling is both costly and time-intensive. In this paper, we propose a novel data augmentation methodology SDA (Self Data Augmentation) that employs DPR models to automatically annotate unanswered questions. Specifically, we initiate the process by retrieving relevant pseudo passages for these unlabeled questions. We subsequently introduce three distinct passage selection methods to annotate these pseudo passages. Ultimately, we amalgamate the pseudo-labeled passages with the unanswered questions to create augmented data. Our experimental evaluations conducted on two extensive datasets (Natural Questions and TriviaQA), alongside a reletively small dataset (WebQuestions), utilizing three diverse base models, illustrate the significant enhancement achieved through the incorporation of freshly augmented data. Moreover, our proposed data augmentation method exhibits remarkable flexibility, which is readily adaptable to various dense retrievers. Additionally, we have conducted a comprehensive human study on the augmented data, which further supports our conclusions.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Self Data Augmentation for Open Domain Question Answering

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Information Systems

Lead the way for us

Similar Papers

PTF-FSR: A Parameter Transmission-Free Federated Sequential Recommender System
Wei Yuan ... Hongzhi Yin
ACM Transactions on Information Systems | VOL. -
Wei Yuan, et. al.Wei Yuan ... Hongzhi Yin
12 Dec 2024
ACM Transactions on Information Systems | VOL. -

Self Data Augmentation for Open Domain Question Answering
Qin Zhang ... Meng Fang
ACM Transactions on Information Systems | VOL. -
Qin Zhang, et. al.Qin Zhang ... Meng Fang
10 Dec 2024
ACM Transactions on Information Systems | VOL. -

Privacy-Preserving Sequential Recommendation with Collaborative Confusion
Wei Wang ... Yujun Li
ACM Transactions on Information Systems | VOL. -
Wei Wang, et. al.Wei Wang ... Yujun Li
06 Dec 2024
ACM Transactions on Information Systems | VOL. -

Efficient and Adaptive Recommendation Unlearning: A Guided Filtering Framework to Erase Outdated Preferences
Yizhou Dang ... Xingwei Wang
ACM Transactions on Information Systems | VOL. -
Yizhou Dang, et. al.Yizhou Dang ... Xingwei Wang
05 Dec 2024
ACM Transactions on Information Systems | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Self Data Augmentation for Open Domain Question Answering

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Information Systems