Natural language processing of Veterans’ electronic health records to confirm diagnoses of monoclonal gammopathy of undetermined significance.

Mei Wang,Martin W Schoen,Theodore Seth Thomas,Graham A Colditz,Lawrence Liu,Yao-Chi Yu,Su-Hsin Chang

doi:10.1200/jco.2022.40.16_suppl.1557

Mei Wang, Martin W Schoen + Show 5 more

https://doi.org/10.1200/jco.2022.40.16_suppl.1557

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

1557 Background: The Veterans Health Administration (VHA) provides extensive electronic health records (EHRs) on Veterans nationwide. Our prior studies utilized VHA data to study the risk of progression from monoclonal gammopathy of undetermined significance (MGUS) to multiple myeloma. These studies relied on International Classification of Disease (ICD) codes and manual abstraction on clinical notes to both identify and verify MGUS patients. Diagnosis confirmation is necessary because many providers place a diagnosis on the clinical notes to order lab tests, which is often left in the EHR despite a negative test result. However, manual abstraction is labor intensive and time consuming. With the advancement in natural language processing (NLP), we developed a model to make MGUS confirmation more efficient. Methods: We randomly selected 700 patients within patients diagnosed with MGUS from 1999-2021 in the VHA identified via ICD codes. A random sample of 500 patients were selected and split into the training (80%) and the testing (20%) sets. The remainder (n = 200) served as the validation set. There were 32,708 unstructured hematology/oncology Text Integration Utility reports and 9,237 lab reports (including 2,322 discrete results and 6,915 unstructured comments). All reports were manually reviewed to confirm MGUS diagnoses and served as the reference standard. We compiled three lists of keywords suggestive of MGUS diagnosis, subtypes of immunoglobulins, and negation modifiers. We trained a symbolic NLP model to identify diagnoses using combinations of the lists along with M-protein levels from lab reports. The optimized combination that gave the highest recall and precision from the training set was used and evaluated on the testing and validation sets. Results: Among patients with ICD codes for MGUS, manual abstraction confirmed 84% MGUS diagnoses in the testing set and 80% in the validation set. Our NLP model in the training set confirmed 75% and achieved recall, precision, accuracy, and F1 score of 88.1, 98.7, 89.0, and 93.1%, respectively; in the validation set, our rule confirmed 76% patients and the recall, precision, accuracy, and F1 score were 89.4, 94.7, 87.5, and 92.0%, respectively. On average data abstraction took five minutes per patient (excluding data loading time), whereas NLP model completed 13 patients per minute. Conclusions: The developed NLP model to confirm MGUS diagnosis improves accuracy in diagnosis, compared to ICD codes alone. While the performance is similar to that of manual abstraction, our NLP model is an efficient and viable method in MGUS diagnosis confirmation. [Table: see text]

Full Text