Chinese Clinical Named Entity Recognition in Electronic Medical Records: Development of a Lattice Long Short-Term Memory Model With Contextualized Character Representations.

Yongbin Li,Luo Xu,Liping Zou,Hongjin Li,Linhu Hui,Xiaohua Wang,Weihai Liu

doi:10.2196/19848

Abstract

BackgroundClinical named entity recognition (CNER), whose goal is to automatically identify clinical entities in electronic medical records (EMRs), is an important research direction of clinical text data mining and information extraction. The promotion of CNER can provide support for clinical decision making and medical knowledge base construction, which could then improve overall medical quality. Compared with English CNER, and due to the complexity of Chinese word segmentation and grammar, Chinese CNER was implemented later and is more challenging.ObjectiveWith the development of distributed representation and deep learning, a series of models have been applied in Chinese CNER. Different from the English version, Chinese CNER is mainly divided into character-based and word-based methods that cannot make comprehensive use of EMR information and cannot solve the problem of ambiguity in word representation.MethodsIn this paper, we propose a lattice long short-term memory (LSTM) model combined with a variant contextualized character representation and a conditional random field (CRF) layer for Chinese CNER: the Embeddings from Language Models (ELMo)-lattice-LSTM-CRF model. The lattice LSTM model can effectively utilize the information from characters and words in Chinese EMRs; in addition, the variant ELMo model uses Chinese characters as input instead of the character-encoding layer of the ELMo model, so as to learn domain-specific contextualized character embeddings.ResultsWe evaluated our method using two Chinese CNER datasets from the China Conference on Knowledge Graph and Semantic Computing (CCKS): the CCKS-2017 CNER dataset and the CCKS-2019 CNER dataset. We obtained F1 scores of 90.13% and 85.02% on the test sets of these two datasets, respectively.ConclusionsOur results show that our proposed method is effective in Chinese CNER. In addition, the results of our experiments show that variant contextualized character representations can significantly improve the performance of the model.

Highlights

BackgroundElectronic medical records (EMRs) are an important data resource to describe patients’ disease conditions or treatment processes
We divided the dataset into two parts: 1198 electronic medical record G6PD (EMR) were taken as a training set and 398 EMRs were taken as test set
We observed that the Embeddings from Language Models (ELMo)-lattice-long short-term memory (LSTM)-conditional random field (CRF) model we proposed, which integrates lattice LSTM structure and variant pretrained ELMo embedding, achieved excellent results compared with the other models on both Chinese Clinical named entity recognition (CNER) datasets

Summary

Introduction

Electronic medical records (EMRs) are an important data resource to describe patients’ disease conditions or treatment processes. CNER is the key component of clinical text mining and EMR information extraction research and is used for clinical decision support in medical informatics [3]. Chinese CNER has been introduced three times at the China Conference on Knowledge Graph and Semantic Computing (CCKS), from 2017 to 2019, in order to promote the information extraction of Chinese EMRs. In this paper, we conducted research and experiments with our Chinese CNER approach, based on the CCKS-2017 (Task 2) CNER dataset and the CCKS-2019 (Task 1) CNER dataset. Clinical named entity recognition (CNER), whose goal is to automatically identify clinical entities in electronic medical records (EMRs), is an important research direction of clinical text data mining and information extraction. The results of our experiments show that variant contextualized character representations can significantly improve the performance of the model

Objectives

Methods

Results

Discussion

Conclusion