ViMRC - VLSP 2021: Using XLM-RoBERTa and Filter Output for Vietnamese Machine Reading Comprehension

Minh Lê Nguyễn,Văn Nhân Đặng

doi:10.25073/2588-1086/vnucsce.336

Abstract

Machine Reading Comprehension (MRC) has recently made significant progress. This paper is the result of our participation in building an MRC system specifically for Vietnamese on Vietnamese Machine Reading Comprehension at the 8th International Workshop on Vietnamese Language and Speech Processing (VLSP 2021). Based on SQuAD2.0, the organizing committee developed the Vietnamese Question Answering Dataset UIT-ViQuAD2.0, a reading comprehension dataset consisting of questions posed by crowd-workers on a set of Wikipedia Vietnamese articles. The UIT-ViQuAD2.0 dataset evolved from version 1.0 with the difference that version 2.0 contained answerable and unanswerable questions. The challenge of this problem is to distinguish between answerable and unanswerable questions. The answer to every question is a span of text, from the corresponding reading passage, or the question might be unanswerable. Our system employs simple yet highly effective methods. The system uses a pre-trained language model called XLM-RoBERTa (XLM-R), combined with filtering results from multiple output files to produce the final result. We created about 5-7 output files and select the answers with the most repetitions as the final prediction answer. After filtering, our system increased from 75.172% to 76.386% at the F1 measure and achieved 65,329% in the EM measure on the Private Test set.

Full Text