Abstract

Background Accumulating evidence suggests that human health is affected by a complex set of exposures, including to environmental toxicants, dietary constituents, psychosocial stressors, and physical factors. Capturing the complexity of the whole exposome is pivotal for advancing etiological knowledge, yet standard statistical methods, as well as recent developments for mixtures analysis, are limited to considering only a modest number of exposures and interactions while often also making the implicit assumption that the exposures are continuous. Machine learning approaches could offer benefits in these very high-dimensional settings. Method Using gradient boosted decision trees coupled with Bayesian model optimization and a nested cross-validation design, we describe an approach for outcome prediction based on a mixture of exposures, and for identification of specific culprit factors within the mixture that consistently drive the association while allowing for synergistic or antagonistic interactions between the predictors. While this flexible approach is applicable to many settings, we used it to evaluate the association between patients’ history of medication use and risk of amyotrophic lateral sclerosis (ALS), as a steppingstone for integrating additional exposures and in a setting that largely avoids exposure measurement errors that further complicate many toxicant mixtures. Results Of nearly 800 binary predictors, we identified 7 medication classes that were consistently associated with ALS risk across independently trained models. Interactions between medication groups did not substantially affect the risk. Prediction accuracy was consistent, but low due to the lack of information on other etiological risk factors for ALS. Summary The described methodology allows to predict the overall effect of a mixture and to identify specific culprit factors in very high dimensional settings and when both continuous and categorical exposures are of interest. While causal interpretation of purely predictive models should generally be avoided, the repeated sampling of the dataset has interesting causal inference implications.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call