Abstract

Motivation: Despite the central role of diseases in biomedical research, there have been much fewer attempts to automatically determine which diseases are mentioned in a text—the task of disease name normalization (DNorm)—compared with other normalization tasks in biomedical text mining research.Methods: In this article we introduce the first machine learning approach for DNorm, using the NCBI disease corpus and the MEDIC vocabulary, which combines MeSH® and OMIM. Our method is a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data. The technique is based on pairwise learning to rank, which has not previously been applied to the normalization task but has proven successful in large optimization problems for information retrieval.Results: We compare our method with several techniques based on lexical normalization and matching, MetaMap and Lucene. Our algorithm achieves 0.782 micro-averaged F-measure and 0.809 macro-averaged F-measure, an increase over the highest performing baseline method of 0.121 and 0.098, respectively.Availability: The source code for DNorm is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/DNorm, along with a web-based demonstration and links to the NCBI disease corpus. Results on PubMed abstracts are available in PubTator: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTatorContact: zhiyong.lu@nih.gov

Highlights

  • Diseases are central to many lines of biomedical research, and enabling access to disease information is the goal of many information extraction and text mining efforts (Islamaj Dogan and Lu, 2012b; Kang et al, 2012; Neveol et al, 2012; Wiegers et al, 2012)

  • The majority of the errors made by BANNER þ cosine similarity but not by disease name normalization (DNorm) are due to term variation

  • We have shown that pairwise learning to rank (pLTR) successfully learns a mapping from disease name mentions to disease concept names, resulting in a significant improvement in normalization performance

Read more

Summary

Introduction

Diseases are central to many lines of biomedical research, and enabling access to disease information is the goal of many information extraction and text mining efforts (Islamaj Dogan and Lu, 2012b; Kang et al, 2012; Neveol et al, 2012; Wiegers et al, 2012). The task of disease normalization consists of finding disease mentions and assigning a unique identifier to each. This task is important in many lines of inquiry involving disease, including etiology (e.g. gene–disease relationships) and clinical aspects (e.g. diagnosis, prevention and treatment). Given the wide range of concepts that may be categorized as diseases—their respective etiologies, clinical presentations and their various histories of diagnosis and treatment—disease names naturally exhibit considerable variation. This variation presents in synonymous terms for the same disease, and in the diverse logic used to create the disease names themselves

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.