Abstract

The class-imbalance learning is one of the most significant research topics in the data mining and machine learning. Imbalance problem means that one of the classes has much more samples than that of other classes. To deal with the issues of low classification accuracy and high time complexity, this paper proposes an novel imbalance data classification algorithm based on clustering and SVM. The algorithm suggests under-sampling in majority samples based on the distribution characteristics of minority samples. First, specific clusters are detected by cluster analysis on the minority. Second, a cluster boundary strategy is proposed to eliminate the bad influence of noise samples. To structure a balanced dataset for imbalance data, this paper proposes three principles of under-sampling on majority samples according to the characteristic of samples in the cluster. Finally, the optimal classification model from the linear combination of hybrid-kernel SVM is obtained. The experiments based on datasets in UCI and KEEL database show that our algorithm effectively decreases the interference of noise samples. Compared with the SMOTE and Fast-CBUS, the proposed algorithm not only reduces the feature dimension, but also improves the precision of the minor classes under the different labeled sample rates generally.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.