Large Unbalanced Credit Scoring Using Lasso-Logistic Regression Ensemble

Hong Wang,Lifeng Zhou,Qingsong Xu,Frank Emmert-Streib

doi:10.1371/journal.pone.0117844

Abstract

Recently, various ensemble learning methods with different base classifiers have been proposed for credit scoring problems. However, for various reasons, there has been little research using logistic regression as the base classifier. In this paper, given large unbalanced data, we consider the plausibility of ensemble learning using regularized logistic regression as the base classifier to deal with credit scoring problems. In this research, the data is first balanced and diversified by clustering and bagging algorithms. Then we apply a Lasso-logistic regression learning ensemble to evaluate the credit risks. We show that the proposed algorithm outperforms popular credit scoring models such as decision tree, Lasso-logistic regression and random forests in terms of AUC and F-measure. We also provide two importance measures for the proposed model to identify important variables in the data.

Highlights

Credit scoring analyzes the characteristics and performance of past loans and predicts the delinquency probability of loan applicants in the near future based on age, financial flows and repayment records and other characteristics [1]
The proposed LLRE algorithm is implemented in the R programming language
In order to create a diversified variable set for logistic regression base models, we need to generate more variables or do some variable transformation based on these original variables

Summary

Introduction

Credit scoring analyzes the characteristics and performance of past loans and predicts the delinquency probability of loan applicants in the near future based on age, financial flows and repayment records and other characteristics [1]. The most popular ones are linear discriminant analysis [4], logistic regression [5], neural networks [6, 7], decision trees [8], and support vector machines [9] Comprehensive reviews of these methods can be found in [10,11,12,13,14]. Credit Scoring Using Lasso-Logistic Regression Ensemble algorithm using negative correlation was proposed in [23]. Support vector machines (SVM) are frequently chosen as the base learners in credit scoring ensemble algorithms [18, 25]. Results show that the proposed algorithm outperforms decision tree, Lasso-logistic regression and popular ensemble learning algorithms such as random forests in terms of both AUC and F-measure. We apply these two measures for evaluating the top ranking variables that are important for credit scoring

Background

Experiment Results

Results and discussions

Conclusions