Abstract

Risk of complex disorders is thought to be multifactorial, involving interactions between risk factors. However, many genetic studies assess association between disease status and markers one single-nucleotide polymorphism (SNP) at a time, due to the high-dimensional nature of the search space of all possible interactions. Three ensemble methods have been recently proposed for use in high-dimensional data (Monte Carlo logic regression, random forests, and generalized boosted regression). An intuitive way to detect an association between genetic markers and disease status is to use variable importance measures, even though the stability of these measures in the context of a whole-genome association study is unknown. For the simulated data of Problem 3 in the Genetic Analysis Workshop 15 (GAW15), we examined the variability of both rankings and magnitude of variable importance measures using 10 variables simulated to participate in gene x gene and gene x environment interactions. We conducted 500 analyses per method on one randomly selected replicate, tallying the rankings and importance measures for each of the 10 variables of interest. When the simulated effect size was strong, all three methods showed stable rankings and estimates of variable importance. However, under conditions more commonly expected to be encountered in complex diseases, random forests and generalized boosted regression showed more stable estimates of variable importance and variable rankings. Individuals endeavoring to apply statistical learning methods to detect interaction in complex disease studies should perform repeated analyses in order to assure variable importance measures and rankings do not vary greatly, even for statistical learning algorithms that are thought to be stable.

Highlights

  • The use of statistical learning methods to detect interactions between genetic and environmental risk factors is fuelled by the necessity of using methods developed for use with high-dimensional data (e.g., whole-genome association studies (WGAs)), in which explicitly considering all possible two-way, three-way, or higher-order interactions is computationally not feasible

  • Using the Genetic Analysis Workshop 15 (GAW15) data simulated to mimic a genome-wide association study of rheumatoid arthritis (RA), we tested three statistical learning tools to assess variability in rankings and importance scores within each method on variables simulated as participating in gene × gene or gene × environment interaction

  • With regards to the smaller effect sizes, GBM and random forests (RF) seemed less variable in rankings and importance scores than Monte Carlo logic regression (MCLR), and GBM was superior to RF in stability of rankings and importance measures, which is not surprising because boosting is an averaging process across an additive expansion of trees [13]

Read more

Summary

Introduction

The use of statistical learning methods to detect interactions between genetic and environmental risk factors is fuelled by the necessity of using methods developed for use with high-dimensional data (e.g., whole-genome association studies (WGAs)), in which explicitly considering all possible two-way, three-way, or higher-order interactions is computationally not feasible. In contrast to selecting a 'best fitting' logic tree, Monte Carlo logic regression extends this approach, and tallies single variables and higher order variable interactions, investigated during the run of a homogeneous Markov chain, which can be considered a measure of variable importance [3]. The random forest approach is a classification-tree method that creates an ensemble of trees, generated by bootstrap samples of the data and randomly selected subsets of the predictors. This "random forest" of trees uses a consensus vote to predict the outcome. The contribution of the predictors for reduction of the deviance was used as a measure of variable importance

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call