Abstract

Research ObjectiveHealth services researchers' use of social determinants of health (SDOH) variables in quantitative models is increasing, and many publicly available data sources contain scores of high‐quality, complete SDOH variables. However, determining which SDOH variables are most important to include among those available creates challenges for variable selection. One approach is relying on a conceptual framework, prior research, and intuition. But, often conceptual framework domains broadly describe “external context” or “community factors” that provide little help with identifying specific variables to use. Data science methods, particularly random forest regression, are a potential data‐driven approach for SDOH variable selection. This study compared a qualitative approach and a data‐driven approach to SDOH variable selection to identify key SDOH predictors of county‐level health outcomes.Study DesignWe constructed an initial dataset of county‐level SDOH variables compiled from the following data sources: Area Health Resources File, County Health Rankings, American Community Survey, Picture of Subsidized Households, Penn State University’s Social Capital Index, and the Food Environment Atlas. We then employed a qualitative variable selection approach using the Healthy People 2020 organizing framework for SDOH. We purposively selected 6 variables that touched on all 5 domains of the framework, had sufficient variation across counties, were relatively normally distributed, and had established associations with health outcomes in the literature. Next, we employed a data‐driven variable selection approach using random forest regression. We used 3 random forest regression models, each with a different county‐level health outcome specified, and determined the top 6 SDOH predictors driving each outcome. We used the following outcomes: premature death (days of life lost), proportion of the population reporting fair or poor health, and preventable hospitalization rate (ambulatory care sensitive conditions). We identified overlap among the 6 SDOH predictors determined from each random forest model to determine the final set of variables using the data‐driven approach. We then compared the SDOH variables determined using the data‐driven approach to those selected using the qualitative approach.Population StudiedWe included all 3142 U.S. counties in the analysis, and our dataset contained 81 SDOH variables.Principal FindingsWe selected the following SDOH variables using the qualitative approach: median household income, poverty rate, primary care physician‐to‐population ratio, social deprivation index, food environment index, and proportion of the population that reports severe housing problems. The following SDOH variables were selected using the data‐driven approach: median household income (3 models), poverty rate (2 models), proportion of the population with some college (2 models), proportion of the population who report excessive drinking (2 models), proportion of the population who identifies as American Indian or Alaskan Native (2 models), and social capital index (2 models). Two of the 6 variables selected using the qualitative approach (median household income and poverty rate) were validated by the data‐driven approach.ConclusionsRandom forest models can assist with SDOH variable selection for quantitative analysis. However, variables selected using these techniques may not align well with those selected using qualitative approaches.Implications for Policy or PracticeResearchers should consider using data science approaches to validate and compliment—rather than supplement—qualitative approaches to variable selection.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call