Abstract

In this paper, we comprehensively explain how we built a novel implementation of the Random Forest algorithm on the High Performance Computing Cluster (HPCC) Systems Platform from LexisNexis. The algorithm was previously unavailable on that platform. Random Forest’s learning process is based on the principle of recursive partitioning and although recursion per se is not allowed in ECL (HPCC’s programming language), we were able to implement the recursive partition algorithm as an iterative split/partition process. In addition, we analyze the flaws found in our initial implementation and we thoroughly describe all the modifications required to overcome the bottleneck within the iterative split/partition process, i.e., the optimization of the data gathering of selected independent variables which are used for the node’s best-split analysis. Essentially, we describe how our initial Random Forest implementation has been optimized and has become an efficient distributed machine learning implementation for Big Data. By taking full advantage of the HPCC Systems Platform’s Big Data processing and analytics capabilities, we succeed in enhancing the data gathering method from an inefficient Pass them All and Filter approach into an effective and completely parallelized Fetching on Demand approach. Finally, based upon the results of our learning process runtime comparison between these two approaches, we confirm the speed up of our optimized Random Forest implementation.

Highlights

  • The High Performance Computing Cluster (HPCC) Systems Platform [1, 2] from LexisNexis Risk Solutions is an open source, parallel-processing computing platform designed for Big Data processing and analytics.HPCC is a scalable system based on hardware clusters of commodity servers

  • In “Methods” section, we introduce some basics about the HPCC Platform and its programming language (ECL) and present the background of the supervised learning implementations on HPCC’s ECL-Machine Learning (ML) Library, in “Enterprise Control Language (ECL)” and “Supervised learning in HPCC platform” sections

  • Some of them come from other module’s improvements, such as Decision Trees and sampling, and others are designed to overcome Random Forest (RF) issues, such as Fetching the Independent data from a Sampled Training Data located outside the loopbody function

Read more

Summary

Introduction

The HPCC Systems Platform [1, 2] from LexisNexis Risk Solutions is an open source, parallel-processing computing platform designed for Big Data processing and analytics.HPCC is a scalable system based on hardware clusters of commodity servers. The HPCC Systems Platform [1, 2] from LexisNexis Risk Solutions is an open source, parallel-processing computing platform designed for Big Data processing and analytics. The Enterprise Control Language (ECL) [1, 3], part of the HPCC. Herrera et al J Big Data (2019) 6:68 platform, is a data-centric, declarative, and non-procedural programming language designed for Big Data projects using the LexisNexis HPCC platform. HPCC’s machine learning abilities are implemented within the ECL-ML plug-in module, known as the ECL-ML Library [4], which extends the capabilities of the base HPCC platform. The ECL-ML Library is an open source project [4] created to manage supervised and unsupervised learning, document and text analysis, statistics and probabilities, and general inductive inference-related problems in HPCC

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call