Deep Learning and Thresholding with Class-Imbalanced Big Data

Justin M Johnson,Taghi M Khoshgoftaar

doi:10.1109/icmla.2019.00134

Abstract

Class imbalance is a regularly occurring problem in machine learning that has been studied extensively over the last two decades. Various methods for addressing class imbalance have been introduced, including algorithm-level methods, datalevel methods, and hybrid methods. While these methods are well studied using traditional machine learning algorithms, there are relatively few studies that explore their application to deep neural networks. Thresholding, in particular, is rarely discussed in the deep learning with class imbalance literature. This paper addresses this gap by conducting a systematic study on the application of thresholding with deep neural networks using a Big Data Medicare fraud data set. We use random oversampling (ROS), random under-sampling (RUS), and a hybrid ROS-RUS to create 15 training distributions with varying levels of class imbalance. With the fraudulent class size ranging from 0.03%-60%, we identify optimal classification thresholds for each distribution on random validation sets and then score the thresholds on a 20% holdout test set. Through repetition and statistical analysis, confidence intervals show that the default threshold is never optimal when training data is imbalanced. Results also show that the optimal threshold outperforms the default threshold in nearly all cases, and linear models indicate a strong linear relationship between the minority class size and the optimal decision threshold. To the best of our knowledge, this is the first study to provide statistical results that describe optimal classification thresholds for deep neural networks over a range of class distributions.

Full Text