In this paper, we propose systematic approaches for learning imbalanced data based on a two-regime process: regime 0, which generates excess zeros (majority class), and regime 1, which contributes to generating an outcome of one (minority class). The proposed model contains two latent equations: a split probit (logit) equation in the first stage and an ordinary probit (logit) equation in the second stage. Because boosting improves the accuracy of prediction versus using a single classifier, we combined a boosting strategy with the two-regime process. Thus, we developed the zero-inflated probit boost (ZIPBoost) and zero-inflated logit boost (ZILBoost) methods. We show that the weight functions of ZIPBoost have the desired properties for good predictive performance. Like AdaBoost, the weight functions upweight misclassified examples and downweight correctly classified examples. We show that the weight functions of ZILBoost have similar properties to those of LogitBoost. The algorithm will focus more on examples that are hard to classify in the next iteration, resulting in improved prediction accuracy. We provide the relative performance of ZIPBoost and ZILBoost, which rely on the excess kurtosis of the data distribution. Furthermore, we show the convergence and time complexity of our proposed methods. We demonstrate the performance of our proposed methods using a Monte Carlo simulation, mergers and acquisitions (M&A) data application, and imbalanced datasets from the Keel repository. The results of the experiments show that our proposed methods yield better prediction accuracy compared to other learning algorithms.
Read full abstract