A Fast Parallel Random Forest Algorithm Based on Spark

Linzi Yin,Xuemei Xu,Zhaohui Jiang,Ken Chen

doi:10.3390/app13106121

Abstract

To improve the computational efficiency and classification accuracy in the context of big data, an optimized parallel random forest algorithm is proposed based on the Spark computing framework. First, a new Gini coefficient is defined to reduce the impact of feature redundancy for higher classification accuracy. Next, to reduce the number of candidate split points and Gini coefficient calculations for continuous features, an approximate equal-frequency binning method is proposed to determine the optimal split points efficiently. Finally, based on Apache Spark computing framework, the forest sampling index (FSI) table is defined to speed up the parallel training process of decision trees and reduce data communication overhead. Experimental results show that the proposed algorithm improves the efficiency of constructing random forests while ensuring classification accuracy, and is superior to Spark-MLRF in terms of performance and scalability.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Fast Parallel Random Forest Algorithm Based on Spark

Abstract

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Journal: Applied Sciences	Publication Date: May 17, 2023
License type: CC BY 4.0

Similar Papers

Cooperative Profit Random Forests With Application in Ocean Front Recognition
Jianyuan Sun ... Hina Saeeda
IEEE Access | VOL. 5
Jianyuan Sun, et. al.Jianyuan Sun ... Hina Saeeda
01 Jan 2017
IEEE Access | VOL. 5

An Improved Algorithm based on KNN and Random Forest
Jun Liang ... Qin Liu
-
Jun Liang, et. al.Jun Liang ... Qin Liu
22 Oct 2019
22 Oct 2019

Research on the Classification of High Dimensional Imbalanced Data based on the Optimization of Random Forest Algorithm
Ma Xiaojuan
-
Ma XiaojuanMa Xiaojuan
25 Aug 2018
25 Aug 2018

Depth Limitation and Splitting Criteria Optimization on Random Forest for Efficient Human Activity Classification
Syarif Hidayat ...
International Journal of Advanced Computer Science and Applications | VOL. 10
Syarif Hidayat, et. al.Syarif Hidayat ...
01 Jan 2019
International Journal of Advanced Computer Science and Applications | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Fast Parallel Random Forest Algorithm Based on Spark

Abstract

Talk to us

Similar Papers

More From: Applied Sciences