A novel hybrid undersampling method for mining unbalanced datasets in banking and insurance

G Ganesh Sundarkumar,Vadlamani Ravi

doi:10.1016/j.engappai.2014.09.019

Abstract

In this paper, we propose a novel hybrid approach for rectifying the data imbalance problem by employing k Reverse Nearest Neighborhood and One Class support vector machine (OCSVM) in tandem. We mined an Automobile Insurance Fraud detection dataset and customer Credit Card Churn prediction dataset to demonstrate the effectiveness of the proposed model. Throughout the paper, we followed 10 fold cross validation method of testing using Decision Tree (DT), Support Vector Machine (SVM), Logistic Regression (LR), Probabilistic Neural Network (PNN), Group Method of Data Handling (GMDH), Multi-Layer Perceptron (MLP). We observed that DT and SVM respectively yielded high sensitivity of 90.74% and 91.89% on Insurance dataset and DT, SVM and GMDH respectively produced high sensitivity of 91.2%, 87.7%, and 83.1% on Credit Card Churn Prediction dataset. In the case of Insurance Fraud detection dataset, we found that statistically there is no significant difference between DT (J48) and SVM. As DT yields “if then” rules, we prefer DT over SVM. Further, in the case of churn prediction dataset, it turned out that GMDH, SVM and LR are not statistically different and GMDH yielded very high Area Under Curve at ROC. Further, DT yielded just 4 ‘if–then’ rules on Insurance and 10 rules on churn prediction datasets, which is the significant outcome of the study.

Full Text