Abstract

BackgroundRandom forests (RF) have been increasingly used in applications such as genome-wide association and microarray studies where predictor correlation is frequently observed. Recent works on permutation-based variable importance measures (VIMs) used in RF have come to apparently contradictory conclusions. We present an extended simulation study to synthesize results.ResultsIn the case when both predictor correlation was present and predictors were associated with the outcome (HA), the unconditional RF VIM attributed a higher share of importance to correlated predictors, while under the null hypothesis that no predictors are associated with the outcome (H0) the unconditional RF VIM was unbiased. Conditional VIMs showed a decrease in VIM values for correlated predictors versus the unconditional VIMs under HA and was unbiased under H0. Scaled VIMs were clearly biased under HA and H0.ConclusionsUnconditional unscaled VIMs are a computationally tractable choice for large datasets and are unbiased under the null hypothesis. Whether the observed increased VIMs for correlated predictors may be considered a "bias" - because they do not directly reflect the coefficients in the generating model - or if it is a beneficial attribute of these VIMs is dependent on the application. For example, in genetic association studies, where correlation between markers may help to localize the functionally relevant variant, the increased importance of correlated predictors may be an advantage. On the other hand, we show examples where this increased importance may result in spurious signals.

Highlights

  • Random forests (RF) have been increasingly used in applications such as genome-wide association and microarray studies where predictor correlation is frequently observed

  • We studied the impact of correlated predictors on the resulting variable importance measures generated by the two algorithms, including unscaled, unconditional permutation-based VIMs (RF and conditional inference forest (CIF)), scaled permutationbased VIMs (RF) and conditional permutation-based VIMs (CIF)

  • In what follows, the results of RF VIM and estimated coefficients of bivariate and multiple linear regression models are compared to the coefficients that were used to generate the data by means of a multiple linear regression model

Read more

Summary

Introduction

Random forests (RF) have been increasingly used in applications such as genome-wide association and microarray studies where predictor correlation is frequently observed. Random forest (RF) [1] and related methods such as conditional inference forest (CIF) [2] are both treebuilding methods that have been found increasingly successful in bioinformatics applications This is especially true in statistical genetics, microarray analysis and the broad and rapidly expanding area of -omics studies. Nicodemus and Malley [4] reported that RF prefers uncorrelated predictors over all splits performed in building all trees in the forest under H0 and the alternative hypothesis HA (unless the effect size is large, e.g., an odds ratio of 5.0) because the splitting rule is based on the Gini Index They further reported that, under H0, unconditional permutation-based VIMs are unbiased under within-predictor correlation for both RF and CIF, Gini Index-based VIMs in RF are biased. While the identification of only those predictors associated with the response was found to be aggravated in the presence of predictor correlation, the identification of sets of predictors both associated with the response and correlated with other predictors might be useful, e.g., in genome-wide association studies, where strong LD may be present between physically proximal genetic markers

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.