Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes.

Chin Lin,Chia-Jung Hsu,Shih-Jen Yeh,Chia-Cheng Lee,Hsiang-Cheng Chen,Sui-Lung Su,Yu-Sheng Lou

doi:10.2196/jmir.8344

Abstract

BackgroundAutomated disease code classification using free-text medical information is important for public health surveillance. However, traditional natural language processing (NLP) pipelines are limited, so we propose a method combining word embedding with a convolutional neural network (CNN).ObjectiveOur objective was to compare the performance of traditional pipelines (NLP plus supervised machine learning models) with that of word embedding combined with a CNN in conducting a classification task identifying International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) diagnosis codes in discharge notes.MethodsWe used 2 classification methods: (1) extracting from discharge notes some features (terms, n-gram phrases, and SNOMED CT categories) that we used to train a set of supervised machine learning models (support vector machine, random forests, and gradient boosting machine), and (2) building a feature matrix, by a pretrained word embedding model, that we used to train a CNN. We used these methods to identify the chapter-level ICD-10-CM diagnosis codes in a set of discharge notes. We conducted the evaluation using 103,390 discharge notes covering patients hospitalized from June 1, 2015 to January 31, 2017 in the Tri-Service General Hospital in Taipei, Taiwan. We used the receiver operating characteristic curve as an evaluation measure, and calculated the area under the curve (AUC) and F-measure as the global measure of effectiveness.ResultsIn 5-fold cross-validation tests, our method had a higher testing accuracy (mean AUC 0.9696; mean F-measure 0.9086) than traditional NLP-based approaches (mean AUC range 0.8183-0.9571; mean F-measure range 0.5050-0.8739). A real-world simulation that split the training sample and the testing sample by date verified this result (mean AUC 0.9645; mean F-measure 0.9003 using the proposed method). Further analysis showed that the convolutional layers of the CNN effectively identified a large number of keywords and automatically extracted enough concepts to predict the diagnosis codes.ConclusionsWord embedding combined with a CNN showed outstanding performance compared with traditional methods, needing very little data preprocessing. This shows that future studies will not be limited by incomplete dictionaries. A large amount of unstructured information from free-text medical writing will be extracted by automated approaches in the future, and we believe that the health care field is about to enter the age of big data.

Highlights

Public health surveillance systems are important for identifying unusual events of public health importance and will provide information for public health action [1]
Word embedding combined with a convolutional neural network (CNN) showed outstanding performance compared with traditional methods, needing very little data preprocessing
CNNs with various structures have been developed, we focused on a 1-layer CNN with a filter region size of 1-5 to increase comparability with traditional machine learning technologies

Summary

Introduction

Public health surveillance systems are important for identifying unusual events of public health importance and will provide information for public health action [1]. Most surveillance systems can only use structured data, such as International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) diagnosis codes. The current methods for collecting this structured information usually involve manual identification, but manual identification of disease codes from free-text clinical narratives is laborious and costly. Most surveillance systems do not have enough expert clinical coders for real-time surveillance, and this leads to delays in the release of disease statistics. A timely and computer-based disease classification approach is required to further assist public health action. Automated disease code classification using free-text medical information is important for public health surveillance. Traditional natural language processing (NLP) pipelines are limited, so we propose a method combining word embedding with a convolutional neural network (CNN)

Methods

Results

Discussion

Conclusion