Abstract

We conduct experiments that show the Area Under the Precision Recall Curve (AUPRC) metric provides a more meaningful insight into the impact of Random Undersampling than Area Under the Receiver Operating Characteristic Curve (AUC). Evaluating experiments with multiple metrics is a robust method for overcoming challenges in Machine Learning, such as class imbalance. Random Undersampling is a technique to deal with class imbalance. We find Random Undersampling may provide an improvement to AUC scores. However, at the same time, Random Undersampling may be detrimental to AUPRC scores. AUPRC is a metric that involves precision, whereas AUC does not. In the classification of imbalanced Big Data, an increase in false positive counts has a more noticeable drop in precision scores. Therefore, in application domains where false positives are undesirable, optimizing models for AUPRC is a wise choice. Our contribution is to compare the performance of models in terms of AUPRC and AUC to show the impact of Random Undersampling on the classification of imbalanced Big Data. We compare the performance via experiments in the classification of highly imbalanced Big Data. Models are built with data in its original class ratio, and with data undersampled into 5 distinct class ratios. We report the results of 600 experiments where we apply Random Undersampling to a dataset with about 175 million instances. To the best of our knowledge we are the first to utilize Medicare Part D data which became available in 2021.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call