A Comparative Study of Synthetic Over-sampling Method to Improve the Classification of Poor Households in Yogyakarta Province

B Santoso,H Wijayanto,K A Notodiputro,B Sartono

doi:10.1088/1755-1315/187/1/012048

Abstract

The problems of class imbalance have attracted concerns from researchers in the last few years. Class imbalance problems occur when the data had unbalanced proportions between two or more groups of data which are usually called as minority and majority classes. These problems relate to creation of bias in parameter estimation as well as misclassification of the objects especially for the minority class. These will lead to incorrect prediction of the minority class, and eventually will risk the policy making. Several approaches have been proposed to correct misclassification such as data-based and algorithm-based approaches. As a data-based approach, over-sampling method is very popular nowadays. This approach is basically balancing the distribution of data through addition of synthetic data. This paper discusses the strategies of adding synthetic data in order to improve the accuracy of classification. Moreover, this paper also reviews several over sampling methods for class imbalanced problems. Specifically, the classification of poor households is illustrated by using the National Socio-Economic Survey (Susenas) data which has been stratified according to urban and rural areas. Finally, the K-Nearest Neighbor (KNN), Naïve Bayes, Support Vector Machine (SVM) and Generalized Linear Model (GLM) are employed to evaluate the classification performance by comparing the value of sensitivity and area under the ROC curve (AUC). The simulation result shows that there are bias on parameter estimation both on interception and on slope. The bias gets bigger as the data condition becomes more unbalanced and on small sample. Meanwhile, the classification accuracy will decrease with the decrement of probability (high imbalanced) value especially in the data with small sample. Decreased accuracy of classification mainly occurs in the minority class (sensitivity) and AUC. Based on the simulation result, it is clear that the synthetic over sampling approach can improve the accuracy of classification in minority class through increasing sensitivity value and AUC value. This occur at the small probability (unbalanced data). In line with the simulation results, the over sampling approach also shows the evident of improving the prediction of poor households in Yogyakarta Province. But on the other hand, it can also lead to decreased accuracy and specificity. However, further research is required to obtain a more accurate prediction result for all performance measures.

Full Text