Training one machine learning model with features that all clients have will result in a waste of features, which is likely to adversely affect the model’s performance. To solve the problem, the study attempts a new method, which it to train an individual stacked model for each loan client based on personalised features. Data used contains information of about fifteen million loan applicants, their default status, and 468 features in all. 41 of the features that can be quantitively analysed are selected according to the feature importance output by a Random Forest model. Default prediction of every client is made by a stacked model trained with all selected features he/she has. The stacked model consists of two layers, in which a Light Gradient-Boosting Machine (LGBM) classifier is the base learner, and a Logistic Regression model is the meta learner. As the defaulters account for only 3.14%, which is significantly unbalanced, Area Under the Curve (AUC) and F1 scores are employed to evaluate the method, instead of accuracy. Test results show that models trained by personalised features perform better than the ones trained by shared features. Additionally, the stacked model outperforms individual Logistic Regression model, but performs nearly the same as individual LGBM Classifier. In detailed, the stacked models trained with personalised features result in AUC=0.772 and F1=0.188. Due to data unbalance, although the method’s F1 score is relatively low, it’s considered to be passable. In the future, stacked models combining different models will be attempted.
Read full abstract