Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets

Shikhar Vashishth,Denis Newman-Griffis,Rishabh Joshi,Ritam Dutt,Carolyn P Rosé

doi:10.1016/j.jbi.2021.103880

Shikhar Vashishth, Denis Newman-Griffis + Show 3 more

Open Access

https://doi.org/10.1016/j.jbi.2021.103880

Copy DOI

Abstract

ObjectivesBiomedical natural language processing tools are increasingly being applied for broad-coverage information extraction—extracting medical information of all types in a scientific document or a clinical note. In such broad-coverage settings, linking mentions of medical concepts to standardized vocabularies requires choosing the best candidate concepts from large inventories covering dozens of types. This study presents a novel semantic type prediction module for biomedical NLP pipelines and two automatically-constructed, large-scale datasets with broad coverage of semantic types. MethodsWe experiment with five off-the-shelf biomedical NLP toolkits on four benchmark datasets for medical information extraction from scientific literature and clinical notes. All toolkits adopt a staged approach of mention detection followed by two stages of medical entity linking: (1) generating a list of candidate concepts, and (2) picking the best concept among them. We introduce a semantic type prediction module to alleviate the problem of overgeneration of candidate concepts by filtering out irrelevant candidate concepts based on the predicted semantic type of a mention. We present MedType, a fully modular semantic type prediction model which we integrate into the existing NLP toolkits. To address the dearth of broad-coverage training data for medical information extraction, we further present WikiMed and PubMedDS, two large-scale datasets for medical entity linking. ResultsSemantic type filtering improves medical entity linking performance across all toolkits and datasets, often by several percentage points of F-1. Further, pretraining MedType on our novel datasets achieves state-of-the-art performance for semantic type prediction in biomedical text. ConclusionsSemantic type prediction is a key part of building accurate NLP pipelines for broad-coverage information extraction from biomedical text. We make our source code and novel datasets publicly available to foster reproducible research.

Highlights

Biomedical natural language processing (NLP) tools are increasingly being applied for a wide variety of purposes, from clinical research [1] to quality improvement [2]
Several well-known biomedical NLP tools have been developed as standalone software packages and are regularly used for broad-coverage extraction in non-NLP research: for example, cTAKES [3] has been explored for ischemic stroke classification [4] and studying infection risk [5]; and MetaMap [6] is frequently used in pharmacovigilance [7] and has even been adapted to health outcomes study in social media [8]
Broad-coverage information extraction from biomedical text is an important application area for biomedical NLP tools, and one which poses significant challenges in the scale and diversity of information to extract. To help address these challenges, we introduced semantic type prediction as a modular component of biomedical information extraction pipelines, and presented MedType, a state-of-the-art neural model for semantic type prediction

Summary

Introduction

Biomedical natural language processing (NLP) tools are increasingly being applied for a wide variety of purposes, from clinical research [1] to quality improvement [2]. One of the central challenges in broad-coverage information extraction is the diversity of concepts in the standardized vocabularies that form the backbone of biomedical text analysis [9]. While much of the prior research on biomedical NLP methods has focused on restricted subsets of concepts, such as diseases and disorders or genes and proteins [11], general-purpose tools built for arbitrary use must deal with the full breadth of concept types in reference vocabularies The Unified Medical Language System, or UMLS [10], Metathesaurus contains over 3.5 million unique concepts belonging to 127 different semantic types. While much of the prior research on biomedical NLP methods has focused on restricted subsets of concepts, such as diseases and disorders or genes and proteins [11], general-purpose tools built for arbitrary use must deal with the full breadth of concept types in reference vocabularies

Objectives

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of biomedical informatics	Publication Date: Aug 12, 2021
Citations: 20	License type: cc-by

R Discovery Prime

R Discovery Prime

Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of biomedical informatics

Lead the way for us

Similar Papers

Pictorial Visualization of EMR Summary Interface and Medical Information Extraction of Clinical Notes
Wei Ruan ... Joseph Vincelli
-
Wei Ruan, et. al.Wei Ruan ... Joseph Vincelli
01 Jun 2018
01 Jun 2018

Multimodal temporal-clinical note network for mortality prediction
Haiyang Yang ... Li Kuang
Journal of Biomedical Semantics | VOL. 12
Haiyang Yang, et. al.Haiyang Yang ... Li Kuang
15 Feb 2021
Journal of Biomedical Semantics | VOL. 12

Abbreviation Detection in Vietnamese Clinical Texts
Bao Ho ... Tru Cao
VNU Journal of Science: Computer Science and Communication Engineering | VOL. 34
Bao Ho, et. al.Bao Ho ... Tru Cao
13 Dec 2018
VNU Journal of Science: Computer Science and Communication Engineering | VOL. 34

Clinical notes as indicators for vitamin B12 levels via text data mining
M Botros ... K.A Sikaris
Pathology | VOL. 46
M Botros, et. al.M Botros ... K.A Sikaris
01 Jan 2014
Pathology | VOL. 46

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of biomedical informatics