Medical named entity recognition (NER) in Chinese electronic medical records (CEMRs) has drawn much research attention, and plays a vital prerequisite role for extracting high-value medical information. In 2018, China Health Information Processing Conference (CHIP2018) organized a medical NER academic competition aiming to extract three types of malignant tumor entity from CEMRs. Since the three types of entity are highly domain-specific and interdependency, extraction of them cannot be achieved with a single neural network model. Based on comprehensive study of the three types of entity and the entity interdependencies, we propose a collaborative cooperation of multiple neural network models based approach, which consists of two BiLSTM-CRF models and a CNN model. In order to tackle the problem that target scene dataset is small and entity distributions are sparse, we introduce non-target scene datasets and propose sentence-level neural network model transfer learning. Based on 30,000 real-world CEMRs, we pre-train medical domain-specific Chinese character embeddings with word2vec, GloVe and ELMo, and apply them to our approach respectively to validate effects of pre-trained language models in Chinese medical NER. Also, as control experiments, we apply Gated Recurrent Unit to our approach. Finally, our approach achieves an overall F1-score of 87.60%, which is the state-of-the-art performance to the best of our knowledge. In addition, our approach has won the champion of the medical NER academic competition organized by 2019 China Conference on Knowledge Graph and Semantic Computing, which proves the outstanding generalization ability of our approach.
Read full abstract