Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study.

Tjardo D Maarseveen,Rachel Knevel,Arnd Kleyer,Timo Meinderink,Marcel J T Reinders,Tom W J Huizinga,David Simon,Johannes Knitza,Erik B Van Den Akker

doi:10.2196/23930

Abstract

BackgroundFinancial codes are often used to extract diagnoses from electronic health records. This approach is prone to false positives. Alternatively, queries are constructed, but these are highly center and language specific. A tantalizing alternative is the automatic identification of patients by employing machine learning on format-free text entries.ObjectiveThe aim of this study was to develop an easily implementable workflow that builds a machine learning algorithm capable of accurately identifying patients with rheumatoid arthritis from format-free text fields in electronic health records.MethodsTwo electronic health record data sets were employed: Leiden (n=3000) and Erlangen (n=4771). Using a portion of the Leiden data (n=2000), we compared 6 different machine learning methods and a naïve word-matching algorithm using 10-fold cross-validation. Performances were compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC), and F1 score was used as the primary criterion for selecting the best method to build a classifying algorithm. We selected the optimal threshold of positive predictive value for case identification based on the output of the best method in the training data. This validation workflow was subsequently applied to a portion of the Erlangen data (n=4293). For testing, the best performing methods were applied to remaining data (Leiden n=1000; Erlangen n=478) for an unbiased evaluation.ResultsFor the Leiden data set, the word-matching algorithm demonstrated mixed performance (AUROC 0.90; AUPRC 0.33; F1 score 0.55), and 4 methods significantly outperformed word-matching, with support vector machines performing best (AUROC 0.98; AUPRC 0.88; F1 score 0.83). Applying this support vector machine classifier to the test data resulted in a similarly high performance (F1 score 0.81; positive predictive value [PPV] 0.94), and with this method, we could identify 2873 patients with rheumatoid arthritis in less than 7 seconds out of the complete collection of 23,300 patients in the Leiden electronic health record system. For the Erlangen data set, gradient boosting performed best (AUROC 0.94; AUPRC 0.85; F1 score 0.82) in the training set, and applied to the test data, resulted once again in good results (F1 score 0.67; PPV 0.97).ConclusionsWe demonstrate that machine learning methods can extract the records of patients with rheumatoid arthritis from electronic health record data with high precision, allowing research on very large populations for limited costs. Our approach is language and center independent and could be applied to any type of diagnosis. We have developed our pipeline into a universally applicable and easy-to-implement workflow to equip centers with their own high-performing algorithm. This allows the creation of observational studies of unprecedented size covering different countries for low cost from already available data in electronic health record systems.

Highlights

Electronic health records (EHR) offer an interesting collection of clinical information for observational research, yet a crucial step is an accurate identification of disease cases
Performances were compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC), and F1 score was used as the primary criterion for selecting the best method to build a classifying algorithm
We demonstrate that machine learning methods can extract the records of patients with rheumatoid arthritis from electronic health record data with high precision, allowing research on very large populations for limited costs

Summary

Introduction

Electronic health records (EHR) offer an interesting collection of clinical information for observational research, yet a crucial step is an accurate identification of disease cases. This is commonly done by manual chart review or by using standardized billing codes. Clinical diagnoses can be inferred by performing naïve word-matching on format-free text fields. This approach does not take into account the provided context and is prone to false positives as well. Financial codes are often used to extract diagnoses from electronic health records This approach is prone to false positives. A tantalizing alternative is the automatic identification of patients by employing machine learning on format-free text entries

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: JMIR Medical Informatics	Publication Date: Nov 30, 2020
Citations: 33	License type: cc-by

R Discovery Prime

R Discovery Prime

Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: JMIR Medical Informatics

Lead the way for us

Similar Papers

Predicting Postoperative Mortality With Deep Neural Networks and Natural Language Processing: Model Development and Validation.
Pei-Fu Chen ... Kuan-Chih Chen
JMIR Medical Informatics | VOL. 10
Pei-Fu Chen, et. al.Pei-Fu Chen ... Kuan-Chih Chen
10 May 2022
JMIR Medical Informatics | VOL. 10

Establishment of noninvasive diabetes risk prediction model based on tongue features and machine learning techniques
Jun Li ... Jiatuo Xu
International Journal of Medical Informatics | VOL. 149
Jun Li, et. al.Jun Li ... Jiatuo Xu
22 Feb 2021
International Journal of Medical Informatics | VOL. 149

Quantification of Early Neonatal Oxygen Exposure as a Risk Factor for Retinopathy of Prematurity Requiring Treatment.
Jimmy S Chen ... J Peter Campbell
Ophthalmology Science | VOL. 1
Jimmy S Chen, et. al.Jimmy S Chen ... J Peter Campbell
22 Oct 2021
Ophthalmology Science | VOL. 1

Pediatric ECG-Based Deep Learning to Predict Left Ventricular Dysfunction and Remodeling.
Akhil Vaid ... William G La Cava
Circulation | VOL. 149
Akhil Vaid, et. al.Akhil Vaid ... William G La Cava
05 Feb 2024
Circulation | VOL. 149

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: JMIR Medical Informatics