Abstract Background: The precision medicine initiative calls for the study of genes, behaviors, and environment to improve disease prevention. There is a growing body of research supporting the role of social environment (i.e., the neighborhood in which one lives) in cancer health disparities. However, recent efforts have focused on applying empiric, high-dimensional computing approaches to genetic data, with less of an emphasis on environment. In this study, we adapted and applied empiric machine learning approaches to identify which method would be most effective at evaluating the effects of social environment on advanced prostate cancer in a simulated dataset. As is common in high-dimensional data, we encountered (and will present) statistical challenges that arose during analysis, specifically related to multicollinearity. Methods: Pennsylvania Prostate Cancer Registry data from 1995-2005 were linked to publicly available social environmental data from the 2000 U.S. Census via a geocode at the census tract level using ArcGIS software. This primary data consisted of 86,629 prostate cancer cases and 14,663 census variables. U.S. Census variables, which are defined in terms of neighborhood socioeconomic variables, such as education, income, employment, etc., are known to be highly correlated. A simulated dataset was created using the data structure of our primary dataset, where a set of 10 prespecified variables were independent predictors of a binary outcome, and the remaining 990 variables had no effect. Test and training sets were created and various machine learning approaches were applied and compared: standard regression models (REG), Lasso penalized regression (LASSO), elastic net regression (ELNET), and random forest (RF). The most successful method at identifying “true” variables (or highly correlated surrogates), limiting false-positive results, and consistently replicating findings was considered the most effective approach. Simulations were repeated 500 times, and results summarized. Results: Over the 500 simulations, the methods identified 6.3 (REG), 6.4 (LASSO), 8.2 (ELNET), and 10 (RF) of the 10 true (or highly correlated surrogate) variables. In addition, 38.8 (REG), 13.3 (LASSO), 49.9 (ELNET), and 65 (RF) false positive variables were identified. RF consistently replicated the selection of all 10 variables across simulations 100% of the time, whereas LASSO was consistently unable to identify 2 of the 10 true variables. Conclusions: Preliminary findings suggest a combination of RF and LASSO may be the most effective approach; LASSO has the best overall ability to identify true variables while avoiding false positives; RF identifies true variables consistently. Given that Lasso was unable to detect 2 of the true variables, we will also present findings from multivariate models to allow for adjustment due to residual confounding. Final results should be tested in a real data setting where additional considerations for multicollinearity would need to be explored. Citation Format: Shannon. M Lynch, Yinuo Yin, Elizabeth Handorf. Applying machine learning approaches to social environmental data from the U.S. Census in cancer studies: Challenges and considerations [abstract]. In: Proceedings of the AACR Special Conference on Modernizing Population Sciences in the Digital Age; 2019 Feb 19-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Epidemiol Biomarkers Prev 2020;29(9 Suppl):Abstract nr A03.
Read full abstract