Precursor-induced conditional random fields: connecting separate entities by induction for improved clinical named entity recognition

Wangjin Lee,Jinwook Choi

doi:10.1186/s12911-019-0865-1

Wangjin Lee, Jinwook Choi

Open Access

https://doi.org/10.1186/s12911-019-0865-1

Copy DOI

Abstract

BackgroundThis paper presents a conditional random fields (CRF) method that enables the capture of specific high-order label transition factors to improve clinical named entity recognition performance. Consecutive clinical entities in a sentence are usually separated from each other, and the textual descriptions in clinical narrative documents frequently indicate causal or posterior relationships that can be used to facilitate clinical named entity recognition. However, the CRF that is generally used for named entity recognition is a first-order model that constrains label transition dependency of adjoining labels under the Markov assumption.MethodsBased on the first-order structure, our proposed model utilizes non-entity tokens between separated entities as an information transmission medium by applying a label induction method. The model is referred to as precursor-induced CRF because its non-entity state memorizes precursor entity information, and the model’s structure allows the precursor entity information to propagate forward through the label sequence.ResultsWe compared the proposed model with both first- and second-order CRFs in terms of their F1-scores, using two clinical named entity recognition corpora (the i2b2 2012 challenge and the Seoul National University Hospital electronic health record). The proposed model demonstrated better entity recognition performance than both the first- and second-order CRFs and was also more efficient than the higher-order model.ConclusionThe proposed precursor-induced CRF which uses non-entity labels as label transition information improves entity recognition F1 score by exploiting long-distance transition factors without exponentially increasing the computational time. In contrast, a conventional second-order CRF model that uses longer distance transition factors showed even worse results than the first-order model and required the longest computation time. Thus, the proposed model could offer a considerable performance improvement over current clinical named entity recognition methods based on the CRF models.

Highlights

This paper presents a conditional random fields (CRF) method that enables the capture of specific high-order label transition factors to improve clinical named entity recognition performance
This study focuses on using the interdependency of Named Entity (NE) separated by an arbitrary number of non-entity tokens, a condition that is predominant in clinical texts but rarely captured by first-order CRF models
Dataset description All the experiments were performed on the named entity recognition (NER) sets in clinical and general domains: English clinical texts (i2b2 2012 natural language processing (NLP) shared task data [3]), rheumatism patients’ discharge summaries obtained from Seoul National University Hospital (SNUH) [40], and the Conference on Natural Language Learning (CoNLL)-2003 NER shared task corpus [41]

Summary

Introduction

This paper presents a conditional random fields (CRF) method that enables the capture of specific high-order label transition factors to improve clinical named entity recognition performance. The CRF that is generally used for named entity recognition is a first-order model that constrains label transition dependency of adjoining labels under the Markov assumption. With the recent application of artificial intelligence to the medical field, health information systems are expected to handle medical data in the form of unstructured text. The unstructured clinical text conveys descriptions of patients’ health information, including their histories of illness and hospital treatment. Salient concepts that express a patient’s health status are represented by named entities (NEs) in the text. The identifying textual mentions of health-related concepts, termed clinical named entity recognition (NER), is a sub-problem in the field of clinical natural language processing (NLP) [1]. Heterogeneous classes of clinical entities have been employed in recent studies; these are strongly related to clinical activities, such as medical examination, medication, and diagnosis [2,3,4,5,6,7]

Methods

Results

Discussion

Conclusion