Abstract

394 Background: Previous work by our group has demonstrated that leveraging Machine Learning on diagnostic codes from Electronic Health Records (EHRs), can identify individuals at high-risk for Pancreatic Duct Adenocarcinoma (PDAC), as early as 1 year before current cancer diagnosis. We aim to improve the performance of our existing PDAC risk stratification model, by using an independent, multi-center dataset, and adding lab test features. Methods: EHR data from TriNetX, a federated global health research network, was utilized to develop Logistic Regression (LR) models. Diagnoses and lab test data from 32 different Health Care Organizations in the United States from 2015-2020 was used. PDAC patients ages 60-80 years, were identified using ICD codes, and cross-checked with tumor registry and pathology data to decrease false positives. Only patients with one or more clinical encounter/s, at least 6 months prior to cancer diagnosis, were included. Prediction time cutoffs of 180, 270, and 360 days before PDAC diagnosis were used. Preliminary basic data analysis was initially performed to explore potential lab test features that could be used to improve model performance. The discriminatory capabilities of the LR models were compared using Area Under the Receiver Operating Characteristic Curve (AUC) and 95% Confidence Interval using empirical bootstrap over test data were computed. We used L2-regularized LR, and performed evaluation using cross-validation. We report cross-validation performance. In contrast to prior published work that used predefined feature sets for model development, we incorporated a wide range of indicators, and relied on regularization to address potential overfitting risk. Results: The LR models were trained and evaluated on diagnoses and labs for 25,644 patients (cases= 1352; age-sex paired controls). Lab test administration per patient (i.e., for a given patient, what lab tests were administered and how frequently), was found to be the most valuable feature for improving discrimination. For almost every type of lab test, the average number of administrations per patient was higher for PDAC patients than controls. The top lab tests with highest discriminatory coefficients included glucose, potassium, hematocrit, hemoglobin, sodium, chloride and creatinine. With a 365-day lead time, the diagnoses-based LR obtained a test AUC of 0.58, the lab-test based LR obtained a test AUC of 0.72. The combined diagnoses and lab-test model (“concatenated LR model”) outperformed both of these models, obtaining a test AUC of 0.73. Conclusions: Our findings demonstrate that LR models based on concatenated lab test and diagnoses feature sets (“concatenated LR models”), can outperform both diagnoses-based LR models and lab-test-based LR models, and can be utilized in early prediction of PDAC development.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call