Abstract

This paper studies the classification of unbalanced data sets. First, this kind of data sets is briefly introduced, and then the classification methods of unbalanced data sets are analyzed in detail from different perspectives such as data sampling method, algorithm level, feature level, cost-sensitive function, and deep learning. In addition, the data sampling methods are divided into different technologies for introduction: unbalanced data set classification method based on synthetic minority over-sampling technology (SMOTE), support vector machine (SVM) technology, and k-nearest neighbor (KNN) technology, etc. Then, the advantages and disadvantages of these methods are compared. Finally, the evaluation criteria of the unbalanced data set classifier are summarized, and the future work directions are prospected and summarized.

Highlights

  • Over time, the data tends to change its characteristics, since the number of learning instances in the considered class is not equal, this distribution causes some difficulties in classifying the data sets

  • The true positive rate and false positive rate of confounding matrix, ROC curve, G-means, and other methods are usually used in the classification of uneven data to evaluate the performance of classifiers, because they can better measure the effect of classifiers based on the characteristics of unbalanced data sets

  • The classification of unbalanced data sets is of great significance in data mining, because the unbalanced data sets are very common in real life, and its problems are becoming more and more obvious

Read more

Summary

INTRODUCTION

The data tends to change its characteristics, since the number of learning instances in the considered class is not equal, this distribution causes some difficulties in classifying the data sets. The characteristic of unbalanced data sets is that the instances that are concerned when mining data sets are often minority class, but the number of the class is small. The classification algorithm of unbalanced data sets using the sampling method will be summarized according to the types of techniques used, at the end of this chapter, which is more clear than previous reviews. (1) This paper summarizes and analyzes the classification methods for unbalanced data sets in detail from the aspects of data sampling, algorithm level, feature level and, deep learning methods. (2) In the sampling methods, this paper summarizes the classification methods for the unbalanced data sets from three aspects, synthetic minority over-sampling technique (SMOTE), support vector machine (SVM), and k-nearest neighbor (KNN) in this review than the previous. The external method mentioned is the classification method based on the data sampling technique introduced in this chapter, and the internal method for creating or modifying the algorithm is described in detail in the chapter

UNBALANCED DATA SETS CLASSIFICATION METHOD BASED ON SAMPLING METHOD
SAMPLING METHOD BASED ON SVM
OTHER UNBALANCED DATA SET CLASSIFICATION METHODS
CLASSIFICATION ALGORITHM OF UNBALANCED DATA SETS AT FEATURE LEVEL
EVALUATION CRITERION OF CLASSFIER
FUTURE WORK
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call