Abstract

In recent years, various neural network architectures have been successfully applied to natural language processing (NLP) tasks such as named entity normalization. Named entity normalization is a fundamental task for extracting information in free text, which aims to map entity mentions in a text to gold standard entities in a given domain-specific ontology; however, the normalization task in the biomedical domain is still challenging because of multiple synonyms, various acronyms, and numerous lexical variations. In this study, we regard the task of biomedical entity normalization as a ranking problem and propose an approach to rank normalized concepts. We additionally employ two factors that can notably affect the performance of normalization, such as task-specific pre-training (Task-PT) and calibration approach. Among five different biomedical benchmark corpora, our experimental results show that our proposed model achieved significant improvements over the previous methods and advanced the state-of-the-art performance for biomedical entity normalization, with up to 0.5% increase in accuracy and 1.2% increase in F-score.

Highlights

  • W ITH the rapid development of computational technology, a large amount of literature has accumulated on various aspects regardless of domain

  • The main contributions of our proposed study are as follows: (i) We demonstrate the effectiveness of word representations with pre-trained language models (LMs) rather than context-independent representation; (ii) We utilize pretrained LMs with task-specific sentences in terms of the ranking tasks for biomedical normalization; (iii) We prove that our models employing the calibration method show significant improvements in normalization performance; and (iv) We show that a simple but effective strategy of implementing the incorporation of two different scoring systems is a key factor for performance improvement of our models

  • We evaluated our normalization approach on the English biomedical benchmark corpora described in Table 1: the National Center for Biotechnology Information disease (NCBI) corpus [50], the BioCreative V Chemicals Disease Relationship (CDR) corpus [16], the BioCreative II Gene Normalization (GN) corpus [14], and the plant (Plant) corpus [6]

Read more

Summary

Introduction

W ITH the rapid development of computational technology, a large amount of literature has accumulated on various aspects regardless of domain. Based on a large amount of text data, many researchers consider constructing multiple knowledge bases (KB) of domain-specific ontologies. It is generally useful in many applications, from the general domain to specialized domains such as biomedicine, and beneficial for extracting key information related to entities of interest [1]. The entity normalization task in the biomedical domain is necessary to resolve semantic ambiguity, as each biomedical entity may be written in numerous forms [5]. Many researchers consider the ambiguity resolution to avoid these difficulties, the normalization task in the biomedical domain is still challenging because of multiple synonyms, various acronyms, and numerous lexical variations [6] On the other hand, ‘AS’ can be expanded to various words after abbreviation resolution like ‘Angelman Syndrome (MeSH:D017204)’ or ‘Ammonium Sulfate (MeSH:D000645).’ many researchers consider the ambiguity resolution to avoid these difficulties, the normalization task in the biomedical domain is still challenging because of multiple synonyms, various acronyms, and numerous lexical variations [6]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call