A clinical text classification paradigm using weak supervision and deep representation

Yanshan Wang,Sunghwan Sohn,Liwei Wang,Sijia Liu,Shreyasee Amin,Elizabeth J Atkinson,Feichen Shen,Hongfang Liu

doi:10.1186/s12911-018-0723-6

Yanshan Wang, Sunghwan Sohn + Show 6 more

Open Access

https://doi.org/10.1186/s12911-018-0723-6

Copy DOI

Journal: BMC Medical Informatics and Decision Making	Publication Date: Jan 7, 2019
Citations: 407	License type: open-access

Affiliation: Mayo Clinic

Abstract

BackgroundAutomatic clinical text classification is a natural language processing (NLP) technology that unlocks information embedded in clinical narratives. Machine learning approaches have been shown to be effective for clinical text classification tasks. However, a successful machine learning model usually requires extensive human efforts to create labeled training data and conduct feature engineering. In this study, we propose a clinical text classification paradigm using weak supervision and deep representation to reduce these human efforts.MethodsWe develop a rule-based NLP algorithm to automatically generate labels for the training data, and then use the pre-trained word embeddings as deep representation features for training machine learning models. Since machine learning is trained on labels generated by the automatic NLP algorithm, this training process is called weak supervision. We evaluat the paradigm effectiveness on two institutional case studies at Mayo Clinic: smoking status classification and proximal femur (hip) fracture classification, and one case study using a public dataset: the i2b2 2006 smoking status classification shared task. We test four widely used machine learning models, namely, Support Vector Machine (SVM), Random Forest (RF), Multilayer Perceptron Neural Networks (MLPNN), and Convolutional Neural Networks (CNN), using this paradigm. Precision, recall, and F1 score are used as metrics to evaluate performance.ResultsCNN achieves the best performance in both institutional tasks (F1 score: 0.92 for Mayo Clinic smoking status classification and 0.97 for fracture classification). We show that word embeddings significantly outperform tf-idf and topic modeling features in the paradigm, and that CNN captures additional patterns from the weak supervision compared to the rule-based NLP algorithms. We also observe two drawbacks of the proposed paradigm that CNN is more sensitive to the size of training data, and that the proposed paradigm might not be effective for complex multiclass classification tasks.ConclusionThe proposed clinical text classification paradigm could reduce human efforts of labeled training data creation and feature engineering for applying machine learning to clinical text classification by leveraging weak supervision and deep representation. The experimental experiments have validated the effectiveness of paradigm by two institutional and one shared clinical text classification tasks.

Highlights

Automatic clinical text classification is a natural language processing (NLP) technology that unlocks information embedded in clinical narratives
The results imply that Convolutional Neural Networks (CNN) is able to capture hidden patterns from the weakly labeled training data that are not included in the rule-based NLP algorithms
We first developed a rule-based NLP algorithm to automatically generate labels for the training data, and used the pre-trained word embeddings as deep representation features to eliminate the need for task-specific feature engineering for training machine learning models

Summary

Introduction

Automatic clinical text classification is a natural language processing (NLP) technology that unlocks information embedded in clinical narratives. Large amounts of detailed longitudinal patient information, including lab tests, medications, disease status, and treatment outcomes, has been accumulated electronically and becomes valuable data sources for clinical and translational research [2,3,4]. A well-known challenge faced when using EHR data for research is that large amounts of detailed patient information is embedded in clinical text (e.g., clinical notes and progress reports). One of the popular natural language processing (NLP) technologies, can unlock information embedded in clinical text by extracting structured information (e.g. cancer stage information [5,6,7], disease characteristics [8,9,10] and pathological conditions [11]) from the narratives. Many successful clinical studies applying clinical text classification have been reported, including phenotyping algorithms [12, 13], detection of adverse events [14], improvement of healthcare quality [15, 16] and facilitation of genomics research [17,18,19,20]

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A clinical text classification paradigm using weak supervision and deep representation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making

Lead the way for us

Similar Papers

HealthCare Text Analytics Using Recent ML Techniques
Khem Poudel ... Revanth Kommu
-
Khem Poudel, et. al.Khem Poudel ... Revanth Kommu
01 Jan 2023
01 Jan 2023

Clinical Text Classification of Medical Transcriptions Based on Different Diseases
Yadukrishna Sreekumar ... P K Nizar Banu
-
Yadukrishna Sreekumar, et. al.Yadukrishna Sreekumar ... P K Nizar Banu
01 Jan 2021
01 Jan 2021

Clinical text classification research trends: Systematic literature review and open issues
Ghulam Mujtaba ... Henry Friday Nweke
Expert Systems with Applications | VOL. 116
Ghulam Mujtaba, et. al.Ghulam Mujtaba ... Henry Friday Nweke
15 Sep 2018
Expert Systems with Applications | VOL. 116

Evaluating shallow and deep learning strategies for the 2018 n2c2 shared task on clinical text classification.
Michel Oleynik ... Markus Kreuzthaler
Journal of the American Medical Informatics Association | VOL. 26
Michel Oleynik, et. al.Michel Oleynik ... Markus Kreuzthaler
12 Sep 2019
Evaluating shallow and deep learning strategies for the 2018 n2c2 shared task on clinical text classification.
Michel Oleynik ... Markus Kreuzthaler

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A clinical text classification paradigm using weak supervision and deep representation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making