Abstract

As society has developed, increasing amounts of data have been generated by various industries. The random forest algorithm, as a classification algorithm, is widely used because of its superior performance. However, the random forest algorithm uses a simple random sampling feature selection method when generating feature subspaces which cannot distinguish redundant features, thereby affecting its classification accuracy, and resulting in a low data calculation efficiency in the stand-alone mode. In response to the aforementioned problems, related optimization research was conducted with Spark in the present paper. This improved random forest algorithm performs feature extraction according to the calculated feature importance to form a feature subspace. When generating a random forest model, it selects decision trees based on the similarity and classification accuracy of different decision. Experimental results reveal that compared with the original random forest algorithm, the improved algorithm proposed in the present paper exhibited a higher classification accuracy rate and could effectively classify data.

Highlights

  • The rapid development of the Internet has led to the continuous generation of different types of data in various industries

  • The advantages inherent to the random forest algorithm are notably due to the randomness of the decision tree training subset generated during the training process and the randomness of the decision tree feature subspace

  • In the present study, for the purposes of ensuring the classification accuracy of the random forest algorithm while simultaneously ensuring the stability of the random forest algorithm, the feature importance was calculated, the features were distinguished, and the feature subspace was generated based on the feature importance

Read more

Summary

Introduction

The rapid development of the Internet has led to the continuous generation of different types of data in various industries. As a classification algorithm in data mining, the random forest algorithm [1], is widely used in credit evaluation [2,3,4], image classification [5,6], text classification [7], among others This can be attributed to it being better at avoiding over-fitting without being sensitive to noise values. A random forest algorithm based on rough set theory was considered [13] This algorithm calculated the importance of attributes based on a discernibility matrix and selected the first several attributes with the highest importance to form the feature subspace. Related research has been conducted on randoms forest while processing high-dimensional data, and a hierarchical subspace implementation was proposed to provide a solution for the issue where random forest algorithms cannot distinguish feature correlation when processing high-dimensional data [14]. Parallel implementation of the random forest algorithm based on Spark

Related Work
Analysis of the Random Forest Algorithm
Improved Random Forest Algorithm W-RF
Feature Subspace Generation Strategy Based on Feature Importance
Random Forest of Weights
Experimental
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call