Diversity Forests: Using Split Sampling to Allow for Complex Split Procedures in Random Forest

Roman Hornung

doi:10.5282/ubm/epub.73377

Abstract

Diversity forests are a class of random forest type prediction methods that modifies the split selection procedure of conventional random forests to allow for complex split procedures. While random forests show strong prediction performance when using conventional univariate, binary splitting, the procedure still has disadvantages. For example, interactions between features are not exploited effectively. The split selection procedure of diversity forests consists of choosing the best splits from sets of 'nsplits' candidate splits obtained by random selection from repeatedly sampled, specifically structured collections of splits. This makes complex split procedures computationally tangible while avoiding overfitting. This paper focuses on introducing diversity forests and evaluating its performance for univariate, binary splitting. Specific, complex split procedures will be the focus of future work. Using a collection of 220 real data sets with binary target variables, diversity forests are compared with conventional random forests and random forests using extremely randomized trees. It is seen that randomizing the split selection, as performed by diversity forests, leads to slight improvements in prediction performance and that this performance is quite robust with regard to the specified 'nsplits' value. These results indicate that diversity forests are well suited for realizing complex split procedures in random forests.

Full Text