Abstract

The random forest algorithm under the MapReduce framework has too many redundant and irrelevant features, low training feature information, and low parallelization efficiency when dealing with multihoming big data network problems, so parallelism is based on information theory, and norms is proposed for random forest algorithm (PRFITN). In this paper, the technique used first builds a hybrid dimensional reduction approach (DRIGFN) focused on information gain and the Frobenius norm, successfully reducing the number of redundant and irrelevant features; then, an information theory feature is offered. This results in the dimensionality-reduced dataset. Finally, a technique is suggested in the Reduce stage. The features are grouped in the FGSIT strategy, and the stratified sampling approach is employed to assure the information quantity of the training features in the building of the decision tree in the random forest. When datasets are provided as key/value pairs, it is common to want to aggregate statistics across all objects with the same key. To acquire global classification results and achieve a rapid and equal distribution of key-value pairs, a key-value pair redistribution method (RSKP) is used, which improves the cluster’s parallel efficiency. The approach provides a superior classification impact in multihoming large data networks, particularly for datasets with numerous characteristics, according to the experimental findings. We can utilize feature selection and feature extraction together. In addition to minimizing overfitting and redundancy, lowering dimensionality contributes to improved human interpretation and cheaper computing costs through model simplicity.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call