Bias in random forest variable importance measures: illustrations, sources and a solution.

Carolin Strobl,Anne-Laure Boulesteix,Torsten Hothorn,Achim Zeileis

doi:10.1186/1471-2105-8-25

Carolin Strobl, Anne-Laure Boulesteix + Show 2 more

Open Access

PDF Available

https://doi.org/10.1186/1471-2105-8-25

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

BackgroundVariable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories.ResultsSimulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand.ConclusionWe propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.

Highlights

Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease
We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories
When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories

Summary

Introduction

Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. A more appropriate approach from machine learning, that has been proposed recently for prediction and variable selection in various fields related to bioinformatics and computational biology, is the nonlinear and nonparametric random forest method [3]. It provides variable importance measures for variable selection purposes

Objectives

Methods

Results

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jan 25, 2007
Citations: 2742	License type: CC BY 2.0

R Discovery Prime

Bias in random forest variable importance measures: illustrations, sources and a solution.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Empirical characterization of random forest variable importance measures
Kellie J Archer ... Ryan V Kimes
Computational Statistics & Data Analysis | VOL. 52
Kellie J Archer, et. al.Kellie J Archer ... Ryan V Kimes
30 Aug 2007
Computational Statistics & Data Analysis | VOL. 52

An experimental study of the intrinsic stability of random forest variable importance measures.
Huazhen Wang ... Fan Yang
BMC Bioinformatics | VOL. 17
Huazhen Wang, et. al.Huazhen Wang ... Fan Yang
03 Feb 2016
BMC Bioinformatics | VOL. 17

An AUC-based permutation variable importance measure for random forests
Silke Janitza ... Anne-Laure Boulesteix
BMC Bioinformatics | VOL. 14
Silke Janitza, et. al.Silke Janitza ... Anne-Laure Boulesteix
05 Apr 2013
BMC Bioinformatics | VOL. 14

Mining data with random forests: A survey and results of new tests
A Verikas ... M Bacauskiene
Pattern Recognition | VOL. 44
A Verikas, et. al.A Verikas ... M Bacauskiene
12 Aug 2010
Pattern Recognition | VOL. 44

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Bias in random forest variable importance measures: illustrations, sources and a solution.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: BMC Bioinformatics