Abstract
Unequal data distribution among different classes usually cause a class imbalance problem. Due to the class imbalance, the classification models become biased toward the majority class and misclassify the minority class. Class imbalance issue becomes more complex when it occurs in multi-class data. The most common method to handle the class imbalance is data resampling that involves either over-sampling minority class instances or under-sampling majority class instances. In the case of under-sampling, there is a chance of losing some crucial information, whereas over-sampling can cause an overfitting problem. Therefore, we propose a novel Cluster-based Hybrid Sampling for Imbalance Data (CBHSID) strategy to address these issues. We calculate the mean of the data observations based on the number of classes. CBHSID uses the calculated mean as a threshold value to segregate majority and minority classes. We apply affinity propagation cluster analysis to each class to create sub-clusters. We calculate the distance of each data item of sub-cluster using centroid mean. We remove data observations that are away from the center of sub-cluster during under-sampling. On the other hand, during the over-sampling, we generate synthetic samples using data observations near to the center of sub-cluster. We compared our proposed approach with a few state-of-the-art data balancing methods on 12 binary and 4 multi-class benchmark datasets. Based on Geometric-Mean (G-Mean), Recall, and F1-score, our method outperformed the other compared methods on 14 datasets out of 16. We identified that CBHSID is suitable for addressing class imbalance issues in both binary and multi-class classifications. In the current state, we have only validated CBHSID on stationary data streams. Consequently, CBHSID can further be tested on non-stationary data streams in online learning environments.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.