Research on the Classification of High Dimensional Imbalanced Data based on the Optimization of Random Forest Algorithm

Ma Xiaojuan

doi:10.1145/3297730.3297747

Ma Xiaojuan

https://doi.org/10.1145/3297730.3297747

Copy DOI

Export

Save

Cite

Publication Date: Aug 25, 2018

Abstract
Full-Text
Similar Papers

Abstract

Listen

The random forest is stochastic a forest establishment, there are many decision trees; there is no correlation between each decision tree random forest. The establishment of each decision tree, using the random sampling process is put back, and then uses the voting form of classification and prediction. The algorithm can solve the bottleneck in the performance of a single classifier, so it is widely used in many aspects. Of course, the algorithm also has some room for improvement, according to the random forest algorithm to deal with unbalanced data set when running low efficiency, this paper puts forward approaches to the problem are not a new balance at the same time as the calculation process, showing the growth of the index value, how to improve the prediction speed and shorten the running time, according to the characteristics of the random forest algorithm in the construction process is put forward Based on the domestic and foreign literatures, this paper mainly studies the optimization of random forest from two aspects. Random forest algorithm is an ensemble learning method in the field of machine learning. It is integrated with the classification results of multiple decision trees to form a global classifier. The random forest algorithm compared with other classification algorithms have many advantages, the classified effect advantage is reflected in the classification accuracy and the generalization error is small and has the ability to deal with high dimensional data, the training process of the advantages of learning algorithm of quick and easy parallelization. Based on these two advantages, random forest algorithm has been widely used, and it has become one of the priorities to deal with classification problem. However, when the data type of the unbalanced distribution of the situation, that is the number one category of samples is far less than other types of samples, random forest algorithm will appear ineffective, the generalization error of variable classification problem and a series of. So far, there is not much research on the problem of unbalanced data for random forest classification, and there is no direct and effective method. Some just combine the general processing methods of unbalanced data, such as sampling technique or cost sensitive method. So it is a significant research problem to improve the classification effect of unbalanced data from the random forest algorithm level. Based on this research, this paper analyzes the key steps in the analysis of the effect of random forest classification, and designs a solution to deal with unbalanced data. In this paper, we propose an improved random forest algorithm to deal with the problem of imbalanced data classification by studying the classification method of unbalanced data and the random forest algorithm. Mainly from two aspects of the sub space selection and model integration of random forest. In this paper, the influence of the balanced sampling on the algorithm is also combined with the experimental results. Finally, verify the improved random forest algorithm in unbalanced classification results on public data sets, compared to the original random forest algorithm, in most indicators (cross validation accuracy, AUC index, Kappa coefficient and F1-Measure index) have obvious improvement. The importance of subspace selection and model optimization for random forest algorithm is demonstrated. The research content of this paper has an important academic significance and practical value to guide the classification of imbalanced data, and can be applied to the field of spam detection, anomaly detection, medical diagnosis, DNA sequence identification and so on.

Full Text