Random forest implementation and optimization for Big Data analytics on LexisNexis\u2019s high performance computing cluster platform

Victor M Herrera,Taghi M Khoshgoftaar,Flavio Villanustre,Borko Furht

doi:10.1186/s40537-019-0232-1

Victor M Herrera, Taghi M Khoshgoftaar + Show 2 more

Open Access

https://doi.org/10.1186/s40537-019-0232-1

Copy DOI

Journal: Journal of Big Data	Publication Date: Jul 30, 2019
Citations: 29	License type: open-access

Affiliation: Florida Atlantic University

Abstract

In this paper, we comprehensively explain how we built a novel implementation of the Random Forest algorithm on the High Performance Computing Cluster (HPCC) Systems Platform from LexisNexis. The algorithm was previously unavailable on that platform. Random Forest’s learning process is based on the principle of recursive partitioning and although recursion per se is not allowed in ECL (HPCC’s programming language), we were able to implement the recursive partition algorithm as an iterative split/partition process. In addition, we analyze the flaws found in our initial implementation and we thoroughly describe all the modifications required to overcome the bottleneck within the iterative split/partition process, i.e., the optimization of the data gathering of selected independent variables which are used for the node’s best-split analysis. Essentially, we describe how our initial Random Forest implementation has been optimized and has become an efficient distributed machine learning implementation for Big Data. By taking full advantage of the HPCC Systems Platform’s Big Data processing and analytics capabilities, we succeed in enhancing the data gathering method from an inefficient Pass them All and Filter approach into an effective and completely parallelized Fetching on Demand approach. Finally, based upon the results of our learning process runtime comparison between these two approaches, we confirm the speed up of our optimized Random Forest implementation.

Highlights

The High Performance Computing Cluster (HPCC) Systems Platform [1, 2] from LexisNexis Risk Solutions is an open source, parallel-processing computing platform designed for Big Data processing and analytics.HPCC is a scalable system based on hardware clusters of commodity servers
In “Methods” section, we introduce some basics about the HPCC Platform and its programming language (ECL) and present the background of the supervised learning implementations on HPCC’s ECL-Machine Learning (ML) Library, in “Enterprise Control Language (ECL)” and “Supervised learning in HPCC platform” sections
Some of them come from other module’s improvements, such as Decision Trees and sampling, and others are designed to overcome Random Forest (RF) issues, such as Fetching the Independent data from a Sampled Training Data located outside the loopbody function

Summary

Introduction

The HPCC Systems Platform [1, 2] from LexisNexis Risk Solutions is an open source, parallel-processing computing platform designed for Big Data processing and analytics.HPCC is a scalable system based on hardware clusters of commodity servers. The HPCC Systems Platform [1, 2] from LexisNexis Risk Solutions is an open source, parallel-processing computing platform designed for Big Data processing and analytics. The Enterprise Control Language (ECL) [1, 3], part of the HPCC. Herrera et al J Big Data (2019) 6:68 platform, is a data-centric, declarative, and non-procedural programming language designed for Big Data projects using the LexisNexis HPCC platform. HPCC’s machine learning abilities are implemented within the ECL-ML plug-in module, known as the ECL-ML Library [4], which extends the capabilities of the base HPCC platform. The ECL-ML Library is an open source project [4] created to manage supervised and unsupervised learning, document and text analysis, statistics and probabilities, and general inductive inference-related problems in HPCC

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Random forest implementation and optimization for Big Data analytics on LexisNexis\u2019s high performance computing cluster platform

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Accelerated Deep Learning on HPCC Systems
Robert K L Kennedy ... Taghi M Khoshgoftaar
-
Robert K L Kennedy, et. al.Robert K L Kennedy ... Taghi M Khoshgoftaar
01 Dec 2020
01 Dec 2020

High Performance Computing Cluster System and its Future Aspects in Processing Big Data
C Kishor Kumar Reddy ... G.V.S Raju
-
C Kishor Kumar Reddy, et. al.C Kishor Kumar Reddy ... G.V.S Raju
01 Dec 2015
01 Dec 2015

Automating Account Management on an Academic HPC System with Authentication Federation
Junya Nakamura ... Masatoshi Tsuchiya
Information Engineering Express | VOL. 5
Junya Nakamura, et. al.Junya Nakamura ... Masatoshi Tsuchiya
01 Jan 2019
Information Engineering Express | VOL. 5

Fostering green innovation: the roles of big data analytics capabilities and green supply chain integration
Ayman Wael Alkhatib
European Journal of Innovation Management | VOL. -
Ayman Wael AlkhatibAyman Wael Alkhatib
15 May 2023
European Journal of Innovation Management | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Random forest implementation and optimization for Big Data analytics on LexisNexis\u2019s high performance computing cluster platform

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data