Abstract

AbstractOne of the major problems to be investigated in various application domains in the real world is the skewed/imbalanced distribution of data. The class imbalance problem turns out when the number of samples from one class significantly exceeds the number samples from the other class and when such data is trained on traditional learning algorithms tends to be bias towards the majority class (i.e. class having more number of samples), resulting in a significant deterioration in the classification performance. Recent studies have shown that the presence of other features, such as small disjuncts, overlapping, and noise in data, can make classification much more difficult. In this paper, we propose an effective hybrid method based on clustering and synthetic sample generation for imbalance data classification called ClustSyn. It consists of clustering algorithm along with the synthetic data generation using Mahalanobis distance. The reason for using the Mahalanobis distance is that it can minimize the probability of overlap and maintain the structure of covariance when providing synthetic samples for the minority class. ClustSyn efficiency is compared with existing methods such as AdaBoost, RUSBoost, SMOTEBoost, based on ensemble learning. We have performed experiments with different Imbalance Ratios (IR) on 11 datasets. Results show ClustSyn outperformed existing methods for imbalanced and small disjunct datasets.KeywordsEnsemble learningClass imbalance problemSamplingImbalance ratioMahalanobis distance

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.