Abstract

ObjectivesBiomedical natural language processing tools are increasingly being applied for broad-coverage information extraction—extracting medical information of all types in a scientific document or a clinical note. In such broad-coverage settings, linking mentions of medical concepts to standardized vocabularies requires choosing the best candidate concepts from large inventories covering dozens of types. This study presents a novel semantic type prediction module for biomedical NLP pipelines and two automatically-constructed, large-scale datasets with broad coverage of semantic types. MethodsWe experiment with five off-the-shelf biomedical NLP toolkits on four benchmark datasets for medical information extraction from scientific literature and clinical notes. All toolkits adopt a staged approach of mention detection followed by two stages of medical entity linking: (1) generating a list of candidate concepts, and (2) picking the best concept among them. We introduce a semantic type prediction module to alleviate the problem of overgeneration of candidate concepts by filtering out irrelevant candidate concepts based on the predicted semantic type of a mention. We present MedType, a fully modular semantic type prediction model which we integrate into the existing NLP toolkits. To address the dearth of broad-coverage training data for medical information extraction, we further present WikiMed and PubMedDS, two large-scale datasets for medical entity linking. ResultsSemantic type filtering improves medical entity linking performance across all toolkits and datasets, often by several percentage points of F-1. Further, pretraining MedType on our novel datasets achieves state-of-the-art performance for semantic type prediction in biomedical text. ConclusionsSemantic type prediction is a key part of building accurate NLP pipelines for broad-coverage information extraction from biomedical text. We make our source code and novel datasets publicly available to foster reproducible research.

Highlights

  • Biomedical natural language processing (NLP) tools are increasingly being applied for a wide variety of purposes, from clinical research [1] to quality improvement [2]

  • Several well-known biomedical NLP tools have been developed as standalone software packages and are regularly used for broad-coverage extraction in non-NLP research: for example, cTAKES [3] has been explored for ischemic stroke classification [4] and studying infection risk [5]; and MetaMap [6] is frequently used in pharmacovigilance [7] and has even been adapted to health outcomes study in social media [8]

  • Broad-coverage information extraction from biomedical text is an important application area for biomedical NLP tools, and one which poses significant challenges in the scale and diversity of information to extract. To help address these challenges, we introduced semantic type prediction as a modular component of biomedical information extraction pipelines, and presented MedType, a state-of-the-art neural model for semantic type prediction

Read more

Summary

Introduction

Biomedical natural language processing (NLP) tools are increasingly being applied for a wide variety of purposes, from clinical research [1] to quality improvement [2]. One of the central challenges in broad-coverage information extraction is the diversity of concepts in the standardized vocabularies that form the backbone of biomedical text analysis [9]. While much of the prior research on biomedical NLP methods has focused on restricted subsets of concepts, such as diseases and disorders or genes and proteins [11], general-purpose tools built for arbitrary use must deal with the full breadth of concept types in reference vocabularies The Unified Medical Language System, or UMLS [10], Metathesaurus contains over 3.5 million unique concepts belonging to 127 different semantic types. While much of the prior research on biomedical NLP methods has focused on restricted subsets of concepts, such as diseases and disorders or genes and proteins [11], general-purpose tools built for arbitrary use must deal with the full breadth of concept types in reference vocabularies

Objectives
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call