Abstract

The SARS-CoV-2 (COVID-19) pandemic spotlighted the importance of moving quickly with biomedical research. However, as the number of biomedical research papers continue to increase, the task of finding relevant articles to answer pressing questions has become significant. In this work, we propose a textual data mining tool that supports literature search to accelerate the work of researchers in the biomedical domain. We achieve this by building a neural-based deep contextual understanding model for Question-Answering (QA) and Information Retrieval (IR) tasks. We also leverage the new BREATHE dataset which is one of the largest available datasets of biomedical research literature, containing abstracts and full-text articles from ten different biomedical literature sources on which we pre-train our BioMedBERT model. Our work achieves state-of-the-art results on the QA fine-tuning task on BioASQ 5b, 6b and 7b datasets. In addition, we observe superior relevant results when BioMedBERT embeddings are used with Elasticsearch for the Information Retrieval task on the intelligently formulated BioASQ dataset. We believe our diverse dataset and our unique model architecture are what led us to achieve the state-of-the-art results for QA and IR tasks.

Highlights

  • The COVID-19 pandemic reminded us of the need for a tool that biomedical researchers can use to sift through existing research to extract novel insights, and help them make novel drug discoveries

  • BioMedBERT may be viewed as the new state-of-the-art results for biomedical question-answering tasks

  • We present the BioMedBERT model pre-trained on the BREATHE v1.0 dataset, one of the largest and most diverse datasets of biomedical research literature

Read more

Summary

Introduction

The COVID-19 pandemic reminded us of the need for a tool that biomedical researchers can use to sift through existing research to extract novel insights, and help them make novel drug discoveries. PubMed reports that more than 1 million biomedical research papers are published each year, amounting to nearly two papers per minute (Landhuis, 2016). For papers mentioning COVID-19 alone, as of June 2020 more than 8000 peerreviewed publications have been published on PubMed. With the rate of scientific papers on COVID-19 doubling every fourteen days (Coren, 2020), it is imperative to have a language understanding tool that can extract relevant information from credible literature, such as the research methodology, data, authors, results, and citations (Hao, 2020). We address the problem from an information retrieval perspective, extracting the textual and contextual information from the corpus by taking a hierarchical approach Traditional search approaches such as Lucene-based Elasticsearch (Gormley and Tong, 2015) using BM-25 & Jaccard-based matrices are efficient in retrieving objective answers where the primary task is to extract specific parts of Licence details: http://

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.