Abstract

Datasets that have imbalanced class distributions pose a challenge for learning and classification algorithms. Imbalanced datasets exist in many domains, such as: fraud detection, sentiment analysis, churn prediction, and intrusion detection in computer networks. To solve the imbalance problem, three main approaches are typically used: data resampling, method adaptation and cost-sensitive learning; of these, data resampling, either oversampling the minority class instances or undersampling the majority class instances, is the most used approach. However, in most cases, when implementing these approaches, there is a trade-off between the predictive performance and the complexity. In this paper we introduce a fast, novel clustering-based undersampling technique for addressing binary-class imbalance problems, which demonstrates high predictive performance, while its time complexity is bound by the size of the minority class instances. During the training phase, the algorithm clusters the minority instances and selects a similar number of majority instances from each cluster. A specific classifier is then trained for each cluster. An unlabeled instance is classified as the majority class if it does not fit into any of the clusters. Otherwise, cluster-specific classifiers are used to return the instance's classification, and the results are weighted by the inverse-distance from the clusters. Our evaluation includes several state-of-the-art methods. We plot the Pareto frontier for various datasets, to consider both computational cost and predictive performance measures. Extensive sets of experiments demonstrate that only the suggested method is always found on the frontier.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.