Credit scoring models are important for financial institutions’ credit decisions. This study examined how variables are extracted from loan statements and whether textual variables can improve the accuracy of the default model. We used a combination of forward selection and non-negative matrix factorization to extract variables from loan statements. We also built a credit scoring model using both loan statement and numerical data. The results show that in the comparative analysis, the credit scoring model built using the optimal cut-off logistic regression model and the two types of data had the highest accuracy. Moreover, compared with the credit scoring model constructed using the deep learning method based on word vectors, the credit scoring model in this study had better interpretation. The regression analysis revealed that the variables from the loan statement have a significant effect on the default status.
Read full abstract