Optimizing Random Forests: Spark Implementations of Random Genetic Forests

Sikha Bagui,Timothy Bennett

doi:10.54646/bije.009

Abstract

The Random Forest (RF) algorithm, originally proposed by Breiman [7], is a widely used machine learning algorithm that gains its merit from its fast learning speed as well as high classification accuracy. However, despite its widespread use, the different mechanisms at work in Breiman’s RF are not yet fully understood, and there is still on-going research on several aspects of optimizing the RF algorithm, especially in the big data environment. To optimize the RF algorithm, this work builds new ensembles that optimize the random portions of the RF algorithm using genetic algorithms, yielding Random Genetic Forests (RGF), Negatively Correlated RGF (NC-RGF), and Preemptive RGF (PFS-RGF). These ensembles are compared with Breiman’s classic RF algorithm in Hadoop’s big data framework using Spark on a large, high-dimensional network intrusion dataset, UNSW-NB15.

Full Text