CloudForest: A Scalable and Efficient Random Forest Implementation for Biological Data.

Ryan Bressler,John E Niederhuber,Brady Bernard,Theo A Knijnenburg,Ilya Shmulevich,Joseph G Vockley,Richard B Kreisberg,Panayiotis V Benos

doi:10.1371/journal.pone.0144820

Ryan Bressler, John E Niederhuber + Show 6 more

Open Access

https://doi.org/10.1371/journal.pone.0144820

Copy DOI

Abstract

Random Forest has become a standard data analysis tool in computational biology. However, extensions to existing implementations are often necessary to handle the complexity of biological datasets and their associated research questions. The growing size of these datasets requires high performance implementations. We describe CloudForest, a Random Forest package written in Go, which is particularly well suited for large, heterogeneous, genetic and biomedical datasets. CloudForest includes several extensions, such as dealing with unbalanced classes and missing values. Its flexible design enables users to easily implement additional extensions. CloudForest achieves fast running times by effective use of the CPU cache, optimizing for different classes of features and efficiently multi-threading. https://github.com/ilyalab/CloudForest.

Highlights

Random Forest (RF) [1] has become a widely-used method for classification and regression analysis of biological data
We developed CloudForest, a well-documented RF package with a flexible design that enables straightforward implementation of extensions, many of which are already present in the current version
It is clear that CloudForest retains a consistently better prediction performance than scikit-learn, especially for the case of many missing values (Fig 2a). We found this pattern across most The Cancer Genome Atlas (TCGA) datasets, not always as pronounced as for colorectal cancer (CRC) (Figure D in S1 File)

Summary

Introduction

Random Forest (RF) [1] has become a widely-used method for classification and regression analysis of biological data It often achieves good prediction performance on datasets that are characterized by a large number of features and a relatively small number of samples [2]. Specific extensions and adaptations have been developed to handle the intricacies of certain biological datasets and associated research questions [4, 5]. These include, but are not limited to: unbalanced classes, heterogeneous feature types, alternative notions of feature importance, methods for feature selection, and robustness against noisy and missing data. The huge number of features in biological datasets that are derived from high-throughput genome-wide measurement technologies, such as microarrays and sequencing platforms, necessitates fast RF implementations

Methods

Results

Conclusion