Abstract
To develop and validate natural language processing (NLP)-assisted machine learning (ML)-based classification models to confirm diagnoses of monoclonal gammopathy of undetermined significance (MGUS) and multiple myeloma (MM) from electronic health records (EHRs) in the Veterans Health Administration (VHA). We developed precompiled lexicons and classification rules as features for the following ML classifiers: logistic regression, random forest, and support vector machines (SVMs). These features were trained on 36,044 EHR documents from a random sample of 400 patients with at least one International Classification of Disease code for MGUS diagnosis from 1999 to 2021. The best-performing feature combination was calibrated in the validation set (17,826 documents/200 patients) and evaluated in the testing set (9,250 documents/100 patients). Model performance in diagnosis confirmation was compared with manual chart review results (gold standard) using recall, precision, accuracy, and F1 score. For patients correctly labeled as disease-positive, the difference between model-identified diagnosis dates and the gold standard was also computed. In the testing set, the NLP-assisted classification model using SVMs achieved best performance in both MGUS and MM confirmation with recall/precision/accuracy/F1 of 98.8%/93.3%/93.0%/96.0% for MGUS and 100.0%/92.3%/99.0%/96.0% for MM. Dates of diagnoses matched (±45 days) with those of gold standard in 73.0% of model-confirmed MGUS and 84.6% of model-confirmed MM. An NLP-assisted classification model can reliably confirm MGUS and MM diagnoses and dates and extract laboratory results using automated interpretation of EHR data. This algorithm has the potential to be adapted to other disease areas in VHA EHR system.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have