Abstract

BackgroundSocial-environmental data obtained from the US Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. A barrier to incorporating the full data is a lack of solid recommendations for variable selection, with researchers often hand-selecting a few variables. Thus, we evaluated the ability of empirical machine learning approaches to identify social-environmental factors having a true association with a health outcome.MethodsWe compared several popular machine learning methods, including penalized regressions (e.g. lasso, elastic net), and tree ensemble methods. Via simulation, we assessed the methods’ ability to identify census variables truly associated with binary and continuous outcomes while minimizing false positive results (10 true associations, 1000 total variables). We applied the most promising method to the full census data (p = 14,663 variables) linked to prostate cancer registry data (n = 76,186 cases) to identify social-environmental factors associated with advanced prostate cancer.ResultsIn simulations, we found that elastic net identified many true-positive variables, while lasso provided good control of false positives. Using a combined measure of accuracy, hierarchical clustering based on Spearman’s correlation with sparse group lasso regression performed the best overall. Bayesian Adaptive Regression Trees outperformed other tree ensemble methods, but not the sparse group lasso. In the full dataset, the sparse group lasso successfully identified a subset of variables, three of which replicated earlier findings.ConclusionsThis analysis demonstrated the potential of empirical machine learning approaches to identify a small subset of census variables having a true association with the outcome, and that replicate across empiric methods. Sparse clustered regression models performed best, as they identified many true positive variables while controlling false positive discoveries.

Highlights

  • Social-environmental data obtained from the United States (US) Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis

  • The number of false positives was substantially reduced under the relaxed definition, especially for methods which identify groups of correlated predictors, demonstrating that many of the “false positive” results were identified due to their relationship with a “true positive” variable

  • In simulation studies, we found that methods using hierarchical clustering combined with sparse group lasso (HCLST-CORR-Sparse Group Lasso (SGL)) performed the best at identifying variables with true associations, while providing control of false positive results

Read more

Summary

Introduction

Social-environmental data obtained from the US Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. We evaluated the ability of empirical machine learning approaches to identify social-environmental factors having a true association with a health outcome. As defined by a patient’s neighborhood of residence, is relevant to the study of cancer health disparities. Neighborhood boundaries can be defined by US Census tracts (smaller geographic areas than a county). These neighborhoods can be described by variables measuring economic (e.g., employment, income); physical (e.g., housing/transportation structure); and social (e.g., poverty, education) characteristics [4, 5]. Studies linking US Census data with state and national cancer registry data show that neighborhood can help explain differential cancer incidence and mortality rates beyond race/ethnicity or genetic ancestry, and that neighborhood environment often exerts independent effects on cancer outcomes [6, 7]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.