Abstract

To the Editor: Machine learning techniques may improve risk prediction and disease screening. Class imbalance (ratio of noncases to cases > 1) routinely occurs in epidemiologic data and may degrade the predictive performance of machine learning algorithms.1–4 Of the many techniques developed to address class imbalance,5,6 here, we investigated simple undersampling. This method is straightforward and accessible, but evidence on its performance is mixed and practical guidance is needed. Using simulated data, we assessed the predictive performance of the ensemble machine learning algorithm SuperLearner and logistic regression in imbalanced and undersampled data to investigate whether undersampling alters predictive accuracy. DATA-GENERATING MECHANISM We used Monte Carlo simulation with four groups of 1,000 Monte Carlo samples each. We simulated each Monte Carlo sample to have a sample size of 1,000, 10 independent standard normal covariates generated from a random uniform distribution, and 10 independent dichotomous covariates generated from a binomial distribution. A dichotomous outcome was simulated from a logistic regression model conditional on all 20 covariates. Parameters were chosen to lie between −1 and 1 for the continuous variables, and the outcome prevalence was set to lie between 0.15 and 0.50. STUDY DESIGN In two of the four groups of Monte Carlo samples, we left all samples unbalanced. In the remaining two groups, we performed undersampling to balance each sample by randomly selecting a number of noncases equal to the number of cases. To avoid overfitting, we split each Monte Carlo sample into training (70%) and testing (30%) sets with similar outcome prevalences.7 We generated predicted probabilities on 1,000 undersampled and 1,000 unbalanced samples parametrically via logistic regression and nonparametrically via stacking (SuperLearner).4 We implemented SuperLearner with 10-fold cross-validation and a library of five algorithms with default tuning parameters: extreme gradient boosting, random forests, kernel k-nearest neighbors, kernel support vector machines, and penalized regression (LASSO). Logistic regression was implemented as a generalized linear model with binomial variance and a logit link function. We evaluated average performance metrics (sensitivity, specificity, positive and negative predictive value, and overall accuracy) across all 1,000 Monte Carlo samples in each group using a classification threshold close to the outcome prevalence of 0.2 for unbalanced groups and 0.5 for undersampled groups. We generated areas under the receiver operating curve for each sample using the roc() function in the “pROC” package.8 We conducted all analyses using R version 3.6.1. The Figure shows the receiver operating characteristic (ROC) curves for all 1,000 Monte Carlo samples in each group and average predictive performance metrics. Areas under the curve across all Monte Carlo samples were similar for all groups. Performance metrics were higher for logistic regression than SuperLearner regardless of data preprocessing method except sensitivity and positive predictive value, which were higher for SuperLearner than logistic regression. Undersampling did not have a substantial impact on logistic regression performance; however, undersampling improved SuperLearner accuracy, specificity, and positive predictive value and worsened SuperLearner sensitivity and negative predictive value. Repeating the analysis with a lower outcome prevalence (2%–10%) did not substantially affect the results.FIGURE.: Receiver operating characteristic curves of each Monte Carlo sample by data preprocessing method and prediction technique. The figure displays individual ROC curves (gray lines) for each of the 1,000 Monte Carlo sample within each of the four groups listed below, as well as the average ROC curve (black line) across all 1,000 Monte Carlo samples within each group. Average performance metrics for all 1,000 Monte Carlo samples are also displayed in each panel. A, Logistic regression with undersampling; (B) SuperLearner with undersampling; (C) logistic regression with undersampling; (D) SuperLearner with undersampling.We observed generally more accurate predictive performance with logistic regression than with SuperLearner regardless of data preprocessing method. This is expected because we simulated our data from a logistic model. However, SuperLearner performed nearly as well on average as the true data-generating mechanism although logistic regression was intentionally excluded from the SuperLearner library. In our simulations, undersampling did not dramatically improve predictive performance, suggesting that ensemble machine learning can achieve adequate performance in similar settings with moderate class imbalance. These results provide some insight on the optimal use of machine learning for predicting imbalanced outcomes. Example code to reproduce these analyses is available in the eSupplement; https://links.lww.com/EDE/B675. Abigail R. Cartus, MPHDepartment of EpidemiologyGraduate School of Public HealthUniversity of PittsburghPittsburgh, PA Lisa M. Bodnar, RD, PhDDepartment of EpidemiologyGraduate School of Public HealthUniversity of PittsburghPittsburgh, PADepartment of Obstetrics, Gynecology, and Reproductive SciencesUniversity of Pittsburgh School of MedicinePittsburgh, PAMagee-Womens Research InstituteUniversity of PittsburghPittsburgh, PA Ashley I. Naimi, PhDDepartment of EpidemiologyGraduate School of Public HealthUniversity of PittsburghPittsburgh, PA[email protected]

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call