SNP interaction detection with Random Forests in high-dimensional genetic data

Stacey J Winham,Marianne Huebner,Robert R Freimuth,Joanna M Biernacka,Xin Wang,Mariza De Andrade,Colin L Colby

doi:10.1186/1471-2105-13-164

Stacey J Winham, Marianne Huebner + Show 5 more

Open Access

https://doi.org/10.1186/1471-2105-13-164

Copy DOI

Abstract

BackgroundIdentifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies. However, complicated etiologies such as gene-gene interactions are ignored by the univariate analysis usually applied in these studies. Random Forests (RF) are a popular data-mining technique that can accommodate a large number of predictor variables and allow for complex models with interactions. RF analysis produces measures of variable importance that can be used to rank the predictor variables. Thus, single nucleotide polymorphism (SNP) analysis using RFs is gaining popularity as a potential filter approach that considers interactions in high-dimensional data. However, the impact of data dimensionality on the power of RF to identify interactions has not been thoroughly explored. We investigate the ability of rankings from variable importance measures to detect gene-gene interaction effects and their potential effectiveness as filters compared to p-values from univariate logistic regression, particularly as the data becomes increasingly high-dimensional.ResultsRF effectively identifies interactions in low dimensional data. As the total number of predictor variables increases, probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs, indicating that in high-dimensional data the RF variable importance measures are capturing marginal effects rather than capturing the effects of interactions.ConclusionsWhile RF remains a promising data-mining technique that extends univariate methods to condition on multiple variables simultaneously, RF variable importance measures fail to detect interaction effects in high-dimensional data in the absence of a strong marginal component, and therefore may not be useful as a filter technique that allows for interaction effects in genome-wide data.

Highlights

Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies
In this study, we investigate the ability of Random Forests to detect both marginal and interacting effects in high-dimensional data, in order to validate the claim that RF methods are well suited to describe gene-gene interactions and to determine their usefulness as filter methods or screening tools that allow for interaction effects in large datasets, assuming sample sizes and genetic effect sizes likely to be encountered in real data analysis
The use of alternative definitions of single nucleotide polymorphism (SNP) detection and detection probability could impact the findings of this study; we found that a previous definition of power utilized by Bureau et al [20] is similar to our definition in practice and provided similar results

Summary

Introduction

Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies Complicated etiologies such as gene-gene interactions are ignored by the univariate analysis usually applied in these studies. Risk SNPs identified far can explain only a small percentage of the estimated heritability of such traits This may be partly due to the fact that commonly used single SNP analysis strategies employed in GWAS are designed to detect common variants with strong marginal associations, and are not suitable for detecting complex multigenic disease risk factors, which may account for some of the missing heritability [3,4,5]. It is believed that gene-gene interaction effects, or conditional dependence between genetic variants affecting the phenotype, contribute to complex traits Ignoring those interactions in univariate analyses may be limiting the success of GWAS studies for complex diseases [2,7]

Methods

Results

Discussion

Conclusion