Long multispan prediction model for machine reading comprehension in healthcare domain

Youngjin Jang,Hyeon-Gu Lee,Harksoo Kim

doi:10.1016/j.eswa.2022.119300

Abstract

Machine reading comprehension (MRC) is a question answering task, in which a system provides appropriate answers to users queries in a given document. With large-scale language models and enough training datasets, recent MRC models have surpassed humans in well-designed intrinsic tests that require short and single span answers. However, they have performed poorly in real world applications that require long and multispan answers. In healthcare domain, users want to find long and detailed information (e.g., symptoms of an illness, causes of a disease, and effects of a drug) rather than short and simple ones (e.g., name of an illness, name of a virus, and date of discovery). To satisfy these needs, we propose an MRC model to extract nonconsecutive long text spans from a document. The proposed model detects long candidate answer spans consisting of sentences and determines multiple nonconsecutive spans by using a span matrix. In an experiment using long multispan datasets, namely, MASHQA (a healthcare domain dataset), the proposed model outperformed previous state of the art MRC models in terms of all evaluation parameters.

Full Text