An automated EHR-based tool for identification of patients (pts) with metastatic disease to facilitate clinical trial pt ascertainment.

Jeffrey J Kirshner,James Hamrick,Steven Dunder,Madeline Richey,Caroline Nightingale,Evelyn Siu,Lauren Sutton,Janet Donegan,Zexi Chen,Peter Larson,Kelly Cohn,Karri Donahue

doi:10.1200/jco.2020.38.15_suppl.2051

Abstract

2051 Background: Efforts to facilitate patient identification for clinical trials in routine practice, such as automating electronic health record (EHR) data reviews, are hindered by the lack of information on metastatic status in structured format. We developed a machine learning tool that infers metastatic status from unstructured EHR data, and we describe its real-world implementation. Methods: This machine learning model scans EHR documents, extracting features from text snippets surrounding key words (ie, ‘Metastatic’ ‘Progression’ ‘Local’). A regularized logistic regression model was trained, and used to classify patients across 5 metastatic status inference categories: highly-likely and likely positive, highly-likely and likely negative, and unknown. The model accuracy was characterized using the Flatiron Health EHR-derived de-identified database of patients with solid tumors, where manually abstracted information served as standard accurate reference. We assessed model accuracy using sensitivity and specificity (patients in the ‘unknown’ category omitted from numerator), negative and positive predictive values (NPV, PPV; patients ‘unknown’ included in denominator), and its performance in a real-world dataset. In a separate validation, we evaluated the accuracy gained upon additional user review of the model outputs after integration of this tool into workflows. Results: This metastatic status inference model was characterized using a sample of 66,532 patients. The model sensitivity and specificity (95%CI) were 82.% (82, 83) and 95% (95, 96), respectively; PPV was 89% (89, 90) and NPV was 94% (94, 94). In the validation sample (N = 200 originated from 5 distinct care sites), and after user review of model outputs, values increased to 97% (85, 100) for sensitivity, 98% (95, 100) for specificity, 92 (78, 98) for PPV and 99% (97, 100) for NPV. The model assigned 163/200 patients to the highly-likely categories, which were deemed not to require further EHR review by users. The prevalence of errors was 4% without user review, and 2% after user review. Conclusions: This machine learning model infers metastatic status from unstructured EHR data with high accuracy. The tool assigns metastatic status with high confidence in more than 75% of cases without requiring additional manual review, allowing more efficient identification of clinical trial candidates and clinical trial matching, thus mitigating a key barrier for clinical trial participation in community clinics.

Full Text