An experimental study of the intrinsic stability of random forest variable importance measures.

Huazhen Wang,Zhiyuan Luo,Fan Yang

doi:10.1186/s12859-016-0900-5

Huazhen Wang, Zhiyuan Luo + Show 1 more

Open Access

https://doi.org/10.1186/s12859-016-0900-5

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text
Similar Papers

Abstract

Listen

BackgroundThe stability of Variable Importance Measures (VIMs) based on random forest has recently received increased attention. Despite the extensive attention on traditional stability of data perturbations or parameter variations, few studies include influences coming from the intrinsic randomness in generating VIMs, i.e. bagging, randomization and permutation. To address these influences, in this paper we introduce a new concept of intrinsic stability of VIMs, which is defined as the self-consistence among feature rankings in repeated runs of VIMs without data perturbations and parameter variations. Two widely used VIMs, i.e., Mean Decrease Accuracy (MDA) and Mean Decrease Gini (MDG) are comprehensively investigated. The motivation of this study is two-fold. First, we empirically verify the prevalence of intrinsic stability of VIMs over many real-world datasets to highlight that the instability of VIMs does not originate exclusively from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. Second, through Spearman and Pearson tests we comprehensively investigate how different factors influence the intrinsic stability.ResultsThe experiments are carried out on 19 benchmark datasets with diverse characteristics, including 10 high-dimensional and small-sample gene expression datasets. Experimental results demonstrate the prevalence of intrinsic stability of VIMs. Spearman and Pearson tests on the correlations between intrinsic stability and different factors show that #feature (number of features) and #sample (size of sample) have a coupling effect on the intrinsic stability. The synthetic indictor, #feature/#sample, shows both negative monotonic correlation and negative linear correlation with the intrinsic stability, while OOB accuracy has monotonic correlations with intrinsic stability. This indicates that high-dimensional, small-sample and high complexity datasets may suffer more from intrinsic instability of VIMs. Furthermore, with respect to parameter settings of random forest, a large number of trees is preferred. No significant correlations can be seen between intrinsic stability and other factors. Finally, the magnitude of intrinsic stability is always smaller than that of traditional stability.ConclusionFirst, the prevalence of intrinsic stability of VIMs demonstrates that the instability of VIMs not only comes from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. This finding gives a better understanding of VIM stability, and may help reduce the instability of VIMs. Second, by investigating the potential factors of intrinsic stability, users would be more aware of the risks and hence more careful when using VIMs, especially on high-dimensional, small-sample and high complexity datasets.

Highlights

The stability of Variable Importance Measures (VIMs) based on random forest has recently received increased attention
Influence of the parameter setting on intrinsic stability In order to explore whether or not the intrinsic stability is affected by the parameter setting of VIMs, the distribution of intrinsic stability against different parameter settings are investigated
To investigate the coupling effect of #feature and #sample on the whole 19 datasets, we evaluate the relationship between intrinsic stability and a synthetic indicator #feature/ #sample, which can be seen as an indicator of degree of high dimensional and small sample of the dataset

Summary

Introduction

The stability of Variable Importance Measures (VIMs) based on random forest has recently received increased attention. Despite the extensive attention on traditional stability of data perturbations or parameter variations, few studies include influences coming from the intrinsic randomness in generating VIMs, i.e. bagging, randomization and permutation. To address these influences, in this paper we introduce a new concept of intrinsic stability of VIMs, which is defined as the self-consistence among feature rankings in repeated runs of VIMs without data perturbations and parameter variations. Two widely used VIMs, i.e., Mean Decrease Accuracy (MDA) and Mean Decrease Gini (MDG) are comprehensively investigated. RF provides two Variable Importance Measures (VIMs), i.e. the Mean Decrease Accuracy (MDA) and Mean Decrease Gini (MDG). The feature ranking produced by MDA or MDG serves as a filter to eliminate irrelevant features, and has been applied in a large variety of domains [3, 7,8,9,10,11]

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Published Version

View

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Feb 3, 2016
Citations: 145	License type: CC BY 4.0

R Discovery Prime

An experimental study of the intrinsic stability of random forest variable importance measures.

Abstract

Highlights

Summary

Published Version

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Letter to the Editor: On the stability and ranking of predictors from random forest variable importance measures
K K Nicodemus
Briefings in Bioinformatics | VOL. 12
K K NicodemusK K Nicodemus
15 Apr 2011
Briefings in Bioinformatics | VOL. 12

Bias in random forest variable importance measures: illustrations, sources and a solution.
Carolin Strobl ... Achim Zeileis
BMC Bioinformatics | VOL. 8
Carolin Strobl, et. al.Carolin Strobl ... Achim Zeileis
25 Jan 2007
BMC Bioinformatics | VOL. 8

Application of Random Forest for The Classification Diabetes Mellitus Disease in RSUP Dr. M. Jamil Padang
Fazhira Anisha ... Zilrahmi
UNP Journal of Statistics and Data Science | VOL. 1
Fazhira Anisha, et. al.Fazhira Anisha ... Zilrahmi
08 Mar 2023
UNP Journal of Statistics and Data Science | VOL. 1

An AUC-based permutation variable importance measure for random forests
Silke Janitza ... Anne-Laure Boulesteix
BMC Bioinformatics | VOL. 14
Silke Janitza, et. al.Silke Janitza ... Anne-Laure Boulesteix
05 Apr 2013
BMC Bioinformatics | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

An experimental study of the intrinsic stability of random forest variable importance measures.

Abstract

Highlights

Summary

Published Version

Talk to us

Similar Papers

More From: BMC Bioinformatics