Abstract

To generate an optimal predictive model combining environmental toxicants and known individual and contextual predictors contributing to colorectal cancer (CRC) mortality using a machine-learning framework (ML). Five conventional supervised ML models were used to develop optimal combinations of individual and contextual covariates with environmental exposures variables. The methodology is based on a classification algorithm and regression techniques to better understand the causal inference from individual and contextual level data, as well as environmental exposures data on CRC mortality. The main outcome is CRC patient's vital status (dead vs. alive). This variable is used to predict the all-cause mortality likelihood. Sources of data include the Sentara Cancer Registry, 2010 Census datasets, and Environmental Protection Agency Toxics Release Inventory. Logistic regression was used to predict mortality alongside with the latest methods for prediction using a supervised ML method: Decision Tree, Forest Gradient Boosting, Neural Network, and Bayesian Network. The ‘Best model’ was chosen based on the Area Under the Curve (AUC) value, Kolmogorov-Smirnov (KS) chart or the Youden index, and Root Average Squared Error (RASE). The data linkage resulted in a dataset containing 202 variables that include 16 patient characteristics variables, 3 socio-economic status variables, and 183 chemical-related variables. AUC for the prediction algorithms ranged from 0.5 to 0.77, the KS (Youden) ranged from 0 to 0.4185, and the RASE from 0.4409 to 0.4924. Based on these three statistics, it appears that the Forest Gradient Boosting algorithm is the best predictive model for these data. The model includes 53 important variables of which 35 were environmental covariates. The classical logistic regression included only four variables out of the 202 total variables, and none of them were chemical-related variables. This study contributes to the scientific innovation of leveraging mixed-methods data collection by building a comprehensive database for cancer research.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call