Improving Classification Accuracy based on Random Forest Model through Weighted Sampling for Noisy Data with Linear Decision Boundary

S. Sasikala,S. Bharathidason,C. Jothi Venkateswaran

doi:10.17485/ijst/2015/v8is8/71714

Abstract

Background: Random forest algorithms tend to use a simple random sampling of observations in building their decision trees. The random selection has the chance for noisy, outlier and non informative data to take place during the construction of trees. This leads to inappropriate and poor ensemble classification decision. This paper aims to optimize, the sample selection through probability proportional to size sampling (weighted sampling) in which the noisy, outlier and non informative data points are down weighted to improve the classification accuracy of the model. Methods: The weights of each data pointis determined in two aspects,finding each data pointinfluence on the modelthrough Leave-One-Out method using a single classification tree and measuring the deviance residual of each data point using logistic regression model, these are combined as the final weight. Results: The proposed Finest Random Forest (FRF) performs consistently better than the conventional Random Forest (RF) in terms of classification accuracy. Conclusion: The classification accuracy is improved when random forest is composed with probability proportional to size sampling (weighted sampling) for noisy data with linear decision boundary.Keywords: Classification Accuracy, Decision Trees, Noisy Data, Outlier, Random Forest, Weighted Sampling

Full Text