Abstract

Pancreatic cancer (PaC) is often diagnosed at advanced stages, resulting in one of the lowest survival rates among patients with cancer. The purpose of this study was to investigate whether machine learning (ML) models can predict with high sensitivity and specificity an increased risk for PaC ahead of clinical diagnosis. Optum deidentified electronic health record (EHR) data set was used to extract 1-year data for each patient and to sample for PaC diagnosis, the number of interactions with the health care system, and unique demographic and clinical features. Data for patients with PaC diagnosis were collected between 1 and 2 years before the diagnosis. Standard binary classification ML models were used on training and testing data sets. Data analyses were performed using the scikit-learn package version 1.0.1. The data set consisted of 18,987 patient EHRs collected between December 31, 2007, and December 31, 2017. EHRs with 10 unique features and at least three health care interactions were used for model training (N = 15,189; n = 8,438 [56%] with PaC) and testing (N = 3,798; n = 2,127 [56%] with PaC). The ensemble model achieved an AUC of 0.89, a sensitivity of 85.61%, and a specificity of 76.18% on the testing data set and produced superior results compared with other binary classifiers. Increasing unique health care interactions to nine failed to improve the AUC score. When the testing data set was enlarged to 5,696 patients, the ensemble model achieved an AUC of 0.92 and a specificity of 93.21%, but the sensitivity was compromised. The ensemble model exceeded the state-of-the-art level of performance for prediction of PaC ahead of clinical diagnosis with a minimal clinically guided input, providing a potential strategy for selection of high-risk patients for further screening.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call