Abstract

Imbalanced data problem is widely present in network intrusion detection, spam filtering, biomedical engineering, finance, science, being a challenge in many real-life data-intensive applications. Classifier bias occurs when traditional classification algorithms are used to deal with imbalanced data. As already known, the General Vector Machine (GVM) algorithm has good generalization ability, though it does not work well for the imbalanced classification. Additionally, the state-of-the-art Binary Ant Lion Optimizer (BALO) algorithm has high exploitability and fast convergence rate. Based on these facts, we have proposed in this paper a Cost-sensitive Feature selection General Vector Machine (CFGVM) algorithm based on GVM and BALO algorithms to tackle the imbalanced classification problem, delivering different cost weights to different classes of samples. In our method, the BALO algorithm determines the cost weights and extract more significant features to improve the classification performance. Experiments conducted on eleven imbalanced data sets have shown that the CFGVM algorithm significantly improves the classification performance of minority class samples. By comparing with similar algorithms and state-of-the-art algorithms, the proposed algorithm significantly outperforms in performance and produces better classification results.

Highlights

  • In traditional classification research, there are some basic assumptions: (1) The numbers of samples approximately equal across different classes; (2) The misclassification cost in different classes is roughly the same

  • Because of its novelty and underlying excellent characteristics, the proposed algorithm Cost-sensitive Feature selection General Vector Machine (CFGVM) combines cost-sensitive learning method and feature selection method based on Binary Ant Lion Optimizer (BALO) and General Vector Machine (GVM)

  • The proposed algorithm CFGVM (CFGVM3) combines a costsensitive learning method and feature selection method based on BALO and GVM

Read more

Summary

Introduction

There are some basic assumptions: (1) The numbers of samples approximately equal across different classes; (2) The misclassification cost in different classes is roughly the same. In practical applications, the above two assumptions are difficult to hold. The distribution of raw data in many applications is imbalanced, as they focus on minority categories of related. In the detection of illegal credit card transactions, most of these credit card transactions are standard transactions, and only a small number of credit card transactions are unlawful, which is due to the probability of events leading to data imbalance.

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call