Extractive and Abstractive Text Summarization Model Fine-Tuned Based on BERTSUM and Bio-BERT on COVID-19 Open Research Articles

Jhansi Lakshmi Durga Nunna,Srilatha Chebrolu,V K Hanuman Turaga

doi:10.1007/978-3-031-15175-0_17

Abstract

Bio-BERT (BERT for Bio-medical Text Mining) is a Natural Language Processing (NLP) model, pre-trained on massive bio-medical data. Bio-BERT is effective in an extensive variety of NLP tasks that can be applied to bio-medical data. BERTSUM, BERTSUMABS, and BERTSUMEXTABS are NLP models built for the task of Extractive Text Summarization (ETS) and Abstractive Text Summarization (ATS). These models are evaluated on CNN/DailyMail and Extreme Summarization datasets. In this chapter, the objective is to achieve ETS and ATS for CORD-19 dataset. A hybrid NLP model based on Bio-BERT, BERTSUM, BERTSUMABS, and BERTSUMEXTABS has been proposed. As the objective is to find ETS and ATS on bio-medical datasets, Bio-BERT has been chosen as it is pre-trained on bio-medical PubMed full-text articles. BERTSUM, BERTSUMABS, and BERTSUMEXTABS models are chosen as they were fine-tuned for the task of text summarization. As there is a rapid acceleration in the novel COVID-19 publications, there is a need to obtain a summary of these publications in order to save time. The model generated summary has to be on par with the human written summary. Experiments were conducted on the CORD-19 dataset, and the proposed hybrid model has been evaluated based on ROUGE metric. The proposed model is compared with BERT-based BERTSUM, BERTSUMABS, and BERTSUMEXTABS on CORD-19 dataset and is found to achieve the highest ROUGE values for the task of ETS and ATS.

Full Text