Research on Optimization of Random Forest Algorithm Based on Spark

Suzhen Wang,Zhanfeng Zhang,Shanshan Geng,Chaoyi Pang

doi:10.32604/cmc.2022.015378

Suzhen Wang, Zhanfeng Zhang + Show 2 more

Open Access

https://doi.org/10.32604/cmc.2022.015378

Copy DOI

Abstract

As society has developed, increasing amounts of data have been generated by various industries. The random forest algorithm, as a classification algorithm, is widely used because of its superior performance. However, the random forest algorithm uses a simple random sampling feature selection method when generating feature subspaces which cannot distinguish redundant features, thereby affecting its classification accuracy, and resulting in a low data calculation efficiency in the stand-alone mode. In response to the aforementioned problems, related optimization research was conducted with Spark in the present paper. This improved random forest algorithm performs feature extraction according to the calculated feature importance to form a feature subspace. When generating a random forest model, it selects decision trees based on the similarity and classification accuracy of different decision. Experimental results reveal that compared with the original random forest algorithm, the improved algorithm proposed in the present paper exhibited a higher classification accuracy rate and could effectively classify data.

Highlights

The rapid development of the Internet has led to the continuous generation of different types of data in various industries
The advantages inherent to the random forest algorithm are notably due to the randomness of the decision tree training subset generated during the training process and the randomness of the decision tree feature subspace
In the present study, for the purposes of ensuring the classification accuracy of the random forest algorithm while simultaneously ensuring the stability of the random forest algorithm, the feature importance was calculated, the features were distinguished, and the feature subspace was generated based on the feature importance

Summary

Introduction

The rapid development of the Internet has led to the continuous generation of different types of data in various industries. As a classification algorithm in data mining, the random forest algorithm [1], is widely used in credit evaluation [2,3,4], image classification [5,6], text classification [7], among others This can be attributed to it being better at avoiding over-fitting without being sensitive to noise values. A random forest algorithm based on rough set theory was considered [13] This algorithm calculated the importance of attributes based on a discernibility matrix and selected the first several attributes with the highest importance to form the feature subspace. Related research has been conducted on randoms forest while processing high-dimensional data, and a hierarchical subspace implementation was proposed to provide a solution for the issue where random forest algorithms cannot distinguish feature correlation when processing high-dimensional data [14]. Parallel implementation of the random forest algorithm based on Spark

Related Work

Analysis of the Random Forest Algorithm

Improved Random Forest Algorithm W-RF

Feature Subspace Generation Strategy Based on Feature Importance

Random Forest of Weights

Experimental

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computers, materials & continua	Publication Date: Jan 1, 2022
Citations: 7	License type: cc-by

R Discovery Prime

R Discovery Prime

Research on Optimization of Random Forest Algorithm Based on Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computers, materials & continua

Lead the way for us

Similar Papers

Research on the Classification of High Dimensional Imbalanced Data based on the Optimization of Random Forest Algorithm
Ma Xiaojuan
-
Ma XiaojuanMa Xiaojuan
25 Aug 2018
25 Aug 2018

Memory-Efficient Random Forest Generation Method for Network Intrusion Detection
Seok-Hwan Choi ... Seonjin Hwang
-
Seok-Hwan Choi, et. al.Seok-Hwan Choi ... Seonjin Hwang
01 Jul 2018
01 Jul 2018

A study on the classification of vegetation point cloud based on random forest in the straw checkerboard barriers area
Tiebo Sun ... Shengzong Zhou
Journal of Intelligent & Fuzzy Systems | VOL. 41
Tiebo Sun, et. al.Tiebo Sun ... Shengzong Zhou
01 Jan 2020
Journal of Intelligent & Fuzzy Systems | VOL. 41

Random forest classification of Callicarpa nudiflora from WorldView-3 imagery based on optimized feature space
Ting-Ting Shi ... Lu-Qi Huang
Zhongguo Zhong yao za zhi = Zhongguo zhongyao zazhi = China journal of Chinese materia medica | VOL. 44
Ting-Ting Shi, et. al.Ting-Ting Shi ... Lu-Qi Huang
01 Oct 2019
Zhongguo Zhong yao za zhi = Zhongguo zhongyao zazhi = China journal of Chinese materia medica | VOL. 44

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Research on Optimization of Random Forest Algorithm Based on Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computers, materials & continua