Assessing the Performance of Machine Learning Models for Default Prediction under Missing Data and Class Imbalance: A Simulation Study

doi:10.5784/40-1-767

Abstract

In the field of machine learning, robust model performance is essential for accurate predictions and informed decision-making. One critical challenge that hampers the effectiveness of machine learning algorithms is the presence of missing data. Missing values are ubiquitous in real-world datasets and can significantly impact the performance of predictive models. This study explores the impact of increasing levels of missing values on the performance of machine learning models. Simulated samples with missing values ranging from 5% to 50% were generated, and various models were evaluated accordingly. Missing data is a prevalent change that hinders the performance of machine learning algorithms. The results demonstrated a consistent trend of deteriorating model performance as the amount of missing values increases. Higher levels of missing values lead to decreased accuracy scores across all models. Among the models evaluated, decision trees (DT) and random forests (RF) consistently demonstrated high accuracy scores across all sampling techniques, showcasing their robustness in handling missing values. Logistic regression (LR) also performed relatively well, showing consistent performance across different levels of missing values. On the other hand, stochastic gradient descent classifier (SGDC), K-nearest neighbors (kNN), and naive Bayes (NB) models consistently exhibited lower accuracy scores across all sampling techniques, indicating limitations in handling missing values even when the dataset was more balanced. Furthermore, the study highlights the superiority of the SMOTE (Synthetic Minority OVER-sampling Technique) sampling technique compared to the UNDER-sampling approach. Models trained using SMOTE consistently achieved higher accuracy scores across all levels of missing values. This suggests that SMOTE sampling effectively handles imbalanced datasets and enhances classification performance, particularly when dealing with missing values. In an era where data fuels decision-making, this study's insights into the escalating impact of missing values on machine learning models stand as a clarion call for robust data handling techniques. As the quest for accurate predictions gains paramount importance, addressing the pervasive challenge of missing data emerges as a cornerstone for unlocking the true potential of machine learning in real-world applications.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Assessing the Performance of Machine Learning Models for Default Prediction under Missing Data and Class Imbalance: A Simulation Study

Abstract

Talk to us

Similar Papers

More From: ORION

Lead the way for us

Journal: ORION	Publication Date: Jan 1, 2024
License type: cc-by

Similar Papers

KNNOR: An oversampling technique for imbalanced datasets
Ashhadul Islam ... Halima Bensmail
Applied Soft Computing | VOL. 115
Ashhadul Islam, et. al.Ashhadul Islam ... Halima Bensmail
10 Dec 2021
Applied Soft Computing | VOL. 115

Understanding the Performance of Machine Learning Models to Predict Credit Default: A Novel Approach for Supervisory Evaluation
Andrés Alonso ... Jose Manuel Carbo
SSRN Electronic Journal | VOL. -
Andrés Alonso, et. al.Andrés Alonso ... Jose Manuel Carbo
27 Jan 2021
SSRN Electronic Journal | VOL. -

Influence of Data Splitting on Performance of Machine Learning Models in Prediction of Shear Strength of Soil
Quang Hung Nguyen ... Van Quan Tran
Mathematical Problems in Engineering | VOL. 2021
Quang Hung Nguyen, et. al.Quang Hung Nguyen ... Van Quan Tran
05 Feb 2021
Mathematical Problems in Engineering | VOL. 2021

Issue of Data Imbalance on Low Birthweight Baby Outcomes Prediction and Associated Risk Factors Identification: Establishment of Benchmarking Key Machine Learning Models With Data Rebalancing Strategies.
Yang Ren ... Ana López-Defede
Journal of Medical Internet Research | VOL. 25
Yang Ren, et. al.Yang Ren ... Ana López-Defede
31 May 2023
Journal of Medical Internet Research | VOL. 25

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Assessing the Performance of Machine Learning Models for Default Prediction under Missing Data and Class Imbalance: A Simulation Study

Abstract

Talk to us

Similar Papers

More From: ORION