Prediction of Potential Bank Customers: Application on Data Mining

Muhammet Sinan Başarslan,İrem Düzdar Argun

doi:10.1007/978-3-030-36178-5_9

Abstract

Banking is an important industry, where financial transactions are performed to meet our needs in our everyday lives. Today, banks are frequently used to meet all kinds of financial transactions. In line with the increasing competition, the banks are aiming at acquiring new customers through customer satisfaction. At this point, studies on acquiring new customers by analyzing the customer data have gained importance recently. As a result, data analysis units have been established in the banks. In addition to the banks, these units have also been established for data analysis in customer focused industries such as insurance and telecommunication. In this study, models are established by using classification algorithms to estimate potential bank customers on the bank dataset obtained by telemarketing method in UCI Machine Learning Repository, and the results are compared. Using this comparison result, it is aimed to perform a more detailed and effective data analysis. Various models have been established with various classification algorithms for the estimation of customer acquisition. The classification algorithms used in this study include the C4.5 Decision Tree, Navie Bayes (NB) algorithm, K nearest neighbors algorithm (k-nn), Logistic Regression algorithm (LogReg), Random Forest algorithm (RanFor), and Adaptive Boosting algorithm (AdaBoostM1-Ada). While establishing the classification models, it is aimed to achieve consistency in the performance of the classification models by dividing the test and training data set by two different methods. K-fold Cross Validation and Holdout methods are used for this purpose. In the K-fold cross validation, training and test da-ta sets are separated with 5- and 10-fold cross validation. In the holdout method, the dataset was divided into training and test datasets with the 60–40%, 75–25% and 80–20% training and test separation ratios, respectively. These separations are evaluated for Accuracy (ACC), Precision (PPV), Sensitivity (TPR), and F-measure (F) performance. The performance results are similar in both separation results. According to the Accuracy and F-measure criteria, the classification model established by Random Forest algorithm highest results the other models, whereas the Naive Bayes algorithm gave highest results according to the precision criterion, and the AdaBoostM1 classification algorithm yielded better according to the sensitivity criterion.

Full Text