Abstract

Imbalanced data are known to be notably difficult to deal with, as it needs a thorough understanding of the data to know how it should be done. However, it occurs in many fields, especially in finance, like banking. Many problems that use financial data can be summarized as binary problems. Some of them are fatal if not identified correctly. This research aims to find how to utilize machine learning models to deal with imbalanced data, specifically one that contains binary classification problems. In this paper, we use imbalanced insurance and credit card datasets. The research is conducted by starting from doing feature selection in the datasets by removing irrelevant columns, followed by using SMOTE algorithm variants (K-Means SMOTE and Borderline SMOTE) and pure SMOTE algorithm as the oversampling methods, and Near-Miss and All-KNN for undersampling methods. The algorithms are implemented by using scikit libraries. Lastly, PCA is used for dimensionality reduction and Logistic Regression as the machine learning model with cross-validations for deciding the best hyperparameter. The procedure produces five different Logistic Regression models that differ in how it handles the imbalances, which will be compared. The result shows that the oversampling methods work better than undersampling methods, with K-Means SMOTE and Borderline SMOTE performing better than the pure SMOTE, meaning that machine learning can be used as a solution to deal with binary classification problems in imbalanced financial data.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call