Abstract

Many real-world applications reveal difficulties in learning classifiers from imbalanced data. Although several methods for improving classifiers have been introduced, the identification of conditions for the efficient use of the particular method is still an open research problem. It is also worth to study the nature of imbalanced data, characteristics of the minority class distribution and their influence on classification performance. However, current studies on imbalanced data difficulty factors have been mainly done with artificial datasets and their conclusions are not easily applicable to the real-world problems, also because the methods for their identification are not sufficiently developed. In our paper, we capture difficulties of class distribution in real datasets by considering four types of minority class examples: safe, borderline, rare and outliers. First, we confirm their occurrence in real data by exploring multidimensional visualizations of selected datasets. Then, we introduce a method for an identification of these types of examples, which is based on analyzing a class distribution in a local neighbourhood of the considered example. Two ways of modeling this neighbourhood are presented: with k-nearest examples and with kernel functions. Experiments with artificial datasets show that these methods are able to re-discover simulated types of examples. Next contributions of this paper include carrying out a comprehensive experimental study with 26 real world imbalanced datasets, where (1) we identify new data characteristics basing on the analysis of types of minority examples; (2) we demonstrate that considering the results of this analysis allow to differentiate classification performance of popular classifiers and pre-processing methods and to evaluate their areas of competence. Finally, we highlight directions of exploiting the results of our analysis for developing new algorithms for learning classifiers and pre-processing methods.

Highlights

  • In many real life problems classifiers are faced with imbalanced data, which means that one of the target classes contains a much smaller number of instances than the other classes

  • Class imbalance is an obstacle for learning classifiers as they are biased toward the majority classes and tend to missclassify minority class examples

  • We present the visualisations after the Multidimensional Scaling (MDS) projection of three imbalanced datasets from the UCI repository, often used in the experimental studies concerning class imbalance: thyroid, ecoli and cleveland (Fig. 1b, c and d)

Read more

Summary

Introduction

In many real life problems classifiers are faced with imbalanced data, which means that one of the target classes contains a much smaller number of instances than the other classes. Class imbalances have been observed in many other application problems such as detection of oil spills in satellite images, analysing financial risk, predicting technical equipment failures, managing network intrusion, text categorization and information filtering; for some reviews see, e.g. (He and Garcia 2009; He and Ma 2013) In all those problems the correct recognition of the minority class is of key importance. Class imbalance is an obstacle for learning classifiers as they are biased toward the majority classes and tend to missclassify minority class examples

Objectives
Methods
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.