Abstract
Class imbalance learning (CIL) is an important branch of machine learning as, in general, it is difficult for classification models to learn from imbalanced data; meanwhile, skewed data distribution frequently exists in various real-world applications. In this paper, we introduce a novel solution of CIL called Probability Density Machine (PDM). First, in the context of Gaussian Naive Bayes (GNB) predictive model, we analyze the reason why imbalanced data distribution makes the performance of predictive model decline in theory and draw a conclusion regarding the impact of class imbalance that is only associated with the prior probability, but does not relate to the conditional probability of training data. Then, in such context, we show the rationality of several traditional CIL techniques. Furthermore, we indicate the drawback of combining GNB with these traditional CIL techniques. Next, profiting from the idea of K-nearest neighbors probability density estimation (KNN-PDE), we propose the PDM which is an improved GNB-based CIL algorithm. Finally, we conduct experiments on lots of class imbalance data sets, and the proposed PDM algorithm shows the promising results.
Highlights
Motivated by (4), we observe a new potential class imbalance learning (CIL) solution, i.e., neglecting prior probability and directly estimating the conditional probability belonging to each class to make decision. e solution avoids the tedious procedure of balancing prior probabilities and solves CIL problem in nature. e problem seems to get easier; it is still difficult to provide an accurate estimation for conditional probability
From a theoretical perspective, we tried to analyze the reason why class imbalance distribution hurts the performance of predictive model in context of Gaussian Naive Bayes classifier
It is deduced that the hazard of imbalanced data distribution is only associated with prior probability, but not with conditional probability density
Summary
Learning from imbalanced data is an important and hot topic in machine learning, as it has been widely applied to diagnose and classify diseases [1, 2], detect software defects [3, 4], analyze biology and pharmacology data [5, 6], evaluate credit risk [7], predict actionable revenue change and bankruptcy [8, 9], diagnose faults in the industrial procedure [10, 11], classify soil types [12, 13], and even predict crash injury severity [14] or analyze crime linkages [15]. In the recent two decades, hundreds of CIL algorithms have been proposed to address the imbalanced classification problem [18, 19]. These CIL methods could be roughly divided into three categories: data-level [20,21,22,23,24,25,26,27], algorithmic-level [28,29,30,31,32,33,34,35], and ensemble learning [36,37,38,39,40,41,42]. As for ensemble learning, it adopts either data-level or algorithmic-level approaches to combine with Bagging or Boosting paradigm for improving the accuracy and robustness of CIL
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.