불균형 데이터 집합의 분류를 위한 하이브리드 SVM 모델

Jae Sik Lee,Jong Gu Kwon

doi:10.13088/jiis.2013.19.2.125

Abstract

We call a data set in which the number of records belonging to a certain class far outnumbers the number of records belonging to the other class, `imbalanced data set`. Most of the classification techniques perform poorly on imbalanced data sets. When we evaluate the performance of a certain classification technique, we need to measure not only `accuracy` but also `sensitivity` and `specificity`. In a customer churn prediction problem, `retention` records account for the majority class, and `churn` records account for the minority class. Sensitivity measures the proportion of actual retentions which are correctly identified as such. Specificity measures the proportion of churns which are correctly identified as such. The poor performance of the classification techniques on imbalanced data sets is due to the low value of specificity. Many previous researches on imbalanced data sets employed `oversampling` technique where members of the minority class are sampled more than those of the majority class in order to make a relatively balanced data set. When a classification model is constructed using this oversampled balanced data set, specificity can be improved but sensitivity will be decreased. In this research, we developed a hybrid model of support vector machine (SVM), artificial neural network (ANN) and decision tree, that improves specificity while maintaining sensitivity. We named this hybrid model `hybrid SVM model.` The process of construction and prediction of our hybrid SVM model is as follows. By oversampling from the original imbalanced data set, a balanced data set is prepared. SVM_I model and ANN_I model are constructed using the imbalanced data set, and SVM_B model is constructed using the balanced data set. SVM_I model is superior in sensitivity and SVM_B model is superior in specificity. For a record on which both SVM_I model and SVM_B model make the same prediction, that prediction becomes the final solution. If they make different prediction, the final solution is determined by the discrimination rules obtained by ANN and decision tree. For a record on which SVM_I model and SVM_B model make different predictions, a decision tree model is constructed using ANN_I output value as input and actual retention or churn as target. We obtained the following two discrimination rules: `IF ANN_I output value

Full Text