Esophageal cancer has an insidious presentation with more than half of patients presenting with late-stage disease not amenable to potentially curative chemoradiotherapy or surgery. While screening high risk individuals with endoscopy can provide a survival benefit, screening the general unselected population is not currently recommended. This project used a machine learning approach to create an early prediction model drawing on the content of patients’ electronic health records (EHRs). We identified Medicare beneficiaries diagnosed with esophageal cancer between 2004 and 2013, and classified patients into early-stage esophageal cancer (n = 3,062) and late-stage esophageal cancer (n = 2,359) groups. Early-stage esophageal cancer cases were matched to a cohort of non-esophageal cancer controls in a 16:1 ratio, and this cohort was divided into a training (70%) and test (30%) dataset. From this training cohort we constructed a prediction model to identify individuals with early-stage disease using an eXtreme Gradient Boosting machine learning algorithm that incorporated features including patient demographics, procedure and clinical diagnosis codes, outpatient and PCP visit counts, and an inpatient visit indicator. Model discrimination was assessed with sensitivity, specificity, positive predictive value (PPV), and area under the receiver operating characteristic curve (AUC) with a score of 1.0 indicating perfect prediction. The final predictive model to identify patients with early-stage esophageal cancer included 202 features of which 127 (62.9%) were procedure codes, 67 (33.2%) were diagnosis codes, 6 (3.0%) were demographic features, and 2 (1%) were provider visit counts. The final predictive model had an AUC of 0.776. We evaluated model performance at varying classification thresholds and among different patient populations. When applied to patients over 65, a threshold with a sensitivity of 20.7% produced a specificity of 96.7% and a PPV of 0.068%. Late-stage esophageal cancer patients identified by the model would have been detected a median of 24 months before their actual diagnosis, with three quarters of these patients identified at least 9 months earlier than their recorded diagnosis date. Using EHR data to identify early-stage esophageal cancer patients shows promise, with potential for significantly earlier detection. While widespread use of this approach on an unselected population would produce high rates of false positives, this technique could be employed among high risk patients, or paired with other screening tools.
Read full abstract