Chinese Word Segmentation in Electronic Medical Record Text via Graph Neural Network-Bidirectional LSTM-CRF Model

Jinlian Du,Wei Mi,Xiaolin Du

doi:10.1109/bibm49941.2020.9313165

Abstract

Electronic medical record (EMR) text word segmentation is the basis of natural language processing in medicine. Due to the characteristics of EMR, such as strong specialization, high cost of annotation, special writing style and sustained growth of terminology, the current Chinese word segmentation (CWS) methods cannot fully meet the requirements of the application of EMR. In order to solve this problem, an EMR word segmentation model based on Graph Neural Network (GNN), bidirectional Long Short-Term Memory network (Bi-LSTM) and conditional random field (CRF) is designed in this paper to improve the segmentation effect and reduce the dependence on data set. In the model, GNN based on the domain lexicon is used to learn the local composition features, Bi-LSTM is used to capture the long-term dependence and context sequence information, and CRF is used to obtain the optimal annotation sequence based on the sentence level label information. Through multi-feature interaction, the ambiguity resolution and new word recognition in the EMR word segmentation are effectively carried out. Compared with CWS tools such as Jieba and Pkuseg, as well as baseline models and state-of-the-art methods, the precision and recall rate of the model in this paper have been significantly improved.

Full Text