Evaluating classifier performance with highly imbalanced Big Data

John T Hancock,Justin M Johnson,Taghi M Khoshgoftaar

doi:10.1186/s40537-023-00724-5

John T Hancock, Justin M Johnson + Show 1 more

Open Access

https://doi.org/10.1186/s40537-023-00724-5

Copy DOI

Journal: Journal of Big Data	Publication Date: Apr 11, 2023
Citations: 20	License type: open-access

Affiliation: Florida Atlantic University

Abstract

Using the wrong metrics to gauge classification of highly imbalanced Big Data may hide important information in experimental results. However, we find that analysis of metrics for performance evaluation and what they can hide or reveal is rarely covered in related works. Therefore, we address that gap by analyzing multiple popular performance metrics on three Big Data classification tasks. To the best of our knowledge, we are the first to utilize three new Medicare insurance claims datasets which became publicly available in 2021. These datasets are all highly imbalanced. Furthermore, the datasets are comprised of completely different data. We evaluate the performance of five ensemble learners in the Machine Learning task of Medicare fraud detection. Random Undersampling (RUS) is applied to induce five class ratios. The classifiers are evaluated with both the Area Under the Receiver Operating Characteristic Curve (AUC), and Area Under the Precision Recall Curve (AUPRC) metrics. We show that AUPRC provides a better insight into classification performance. Our findings reveal that the AUC metric hides the performance impact of RUS. However, classification results in terms of AUPRC show RUS has a detrimental effect. We show that, for highly imbalanced Big Data, the AUC metric fails to capture information about precision scores and false positive counts that the AUPRC metric reveals. Our contribution is to show AUPRC is a more effective metric for evaluating the performance of classifiers when working with highly imbalanced Big Data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Evaluating classifier performance with highly imbalanced Big Data

Abstract

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Informative Evaluation Metrics for Highly Imbalanced Big Data Classification
John Hancock ... Justin M Johnson
-
John Hancock, et. al.John Hancock ... Justin M Johnson
01 Dec 2022
01 Dec 2022

Evaluating Performance Metrics for Credit Card Fraud Classification
Joffrey L Leevy ... John Hancock
-
Joffrey L Leevy, et. al.Joffrey L Leevy ... John Hancock
01 Oct 2022
01 Oct 2022

Using Area Under the Precision Recall Curve to Assess the Effect of Random Undersampling in the Classification of Imbalanced Medicare Big Data
John T Hancock ... Taghi M Khoshgoftaar
International Journal of Reliability, Quality and Safety Engineering | VOL. 31
John T Hancock, et. al.John T Hancock ... Taghi M Khoshgoftaar
29 Dec 2023
International Journal of Reliability, Quality and Safety Engineering | VOL. 31

Detecting web attacks using random undersampling and ensemble learners
Richard Zuech ... John Hancock
Journal of Big Data | VOL. 8
Richard Zuech, et. al.Richard Zuech ... John Hancock
27 May 2021
Journal of Big Data | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluating classifier performance with highly imbalanced Big Data

Abstract

Talk to us

Similar Papers

More From: Journal of Big Data