Abstract

This article aims to assess corporate credit risk by predicting the variable that indicates whether the customer has defaulted or not. The dataset used for this purpose is obtained from one of the leading institutions in the finance sector in Türkiye. It consists of 401 variables generally referring to the applicant's data, corporate data, shareholder data, and the applicant's credit history within the creditor's institution. We reduce this large number of variables by identifying the input variables from the others and then studying those inputs to avoid using strongly correlated variables and variables consisting almost entirely of missing or zero values. Many variables in the dataset have too many missing entries but for justifiable reasons. To solve this issue, we created seven subsets to reflect which group of variables relates to which customer. The dataset is imbalanced, consisting of about 96% non-default instances and only around 4% default instances among approved loans. In this paper, we use three sampling techniques to balance the instances in the training sets; under-sampling, oversampling, and synthetic minority oversampling technique, and we apply six classifiers; Random Forest, Naïve Bayes, Logistic Regression, Support Vector Machine, Decision Tree, and K-Nearest Neighbor. To measure the performance of these techniques, we use sensitivity and specificity to measure how well the majority class and minority class were respectively predicted. As a result, we simultaneously achieved greater than 50% sensitivity and specificity, where the under-sampling technique was the best sampling technique for the minority class, and the synthetic minority oversampling technique and oversampling performed better for the majority class.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call