Using Cost-Sensitive Learning and Feature Selection Algorithms to Improve the Performance of Imbalanced Classification

Fang Feng,Jun Shen,Xuhui Yang,Qingguo Zhou,Kuan-Ching Li

doi:10.1109/access.2020.2987364

Abstract

Imbalanced data problem is widely present in network intrusion detection, spam filtering, biomedical engineering, finance, science, being a challenge in many real-life data-intensive applications. Classifier bias occurs when traditional classification algorithms are used to deal with imbalanced data. As already known, the General Vector Machine (GVM) algorithm has good generalization ability, though it does not work well for the imbalanced classification. Additionally, the state-of-the-art Binary Ant Lion Optimizer (BALO) algorithm has high exploitability and fast convergence rate. Based on these facts, we have proposed in this paper a Cost-sensitive Feature selection General Vector Machine (CFGVM) algorithm based on GVM and BALO algorithms to tackle the imbalanced classification problem, delivering different cost weights to different classes of samples. In our method, the BALO algorithm determines the cost weights and extract more significant features to improve the classification performance. Experiments conducted on eleven imbalanced data sets have shown that the CFGVM algorithm significantly improves the classification performance of minority class samples. By comparing with similar algorithms and state-of-the-art algorithms, the proposed algorithm significantly outperforms in performance and produces better classification results.

Highlights

In traditional classification research, there are some basic assumptions: (1) The numbers of samples approximately equal across different classes; (2) The misclassification cost in different classes is roughly the same
Because of its novelty and underlying excellent characteristics, the proposed algorithm Cost-sensitive Feature selection General Vector Machine (CFGVM) combines cost-sensitive learning method and feature selection method based on Binary Ant Lion Optimizer (BALO) and General Vector Machine (GVM)
The proposed algorithm CFGVM (CFGVM3) combines a costsensitive learning method and feature selection method based on BALO and GVM

Summary

Introduction

There are some basic assumptions: (1) The numbers of samples approximately equal across different classes; (2) The misclassification cost in different classes is roughly the same. In practical applications, the above two assumptions are difficult to hold. The distribution of raw data in many applications is imbalanced, as they focus on minority categories of related. In the detection of illegal credit card transactions, most of these credit card transactions are standard transactions, and only a small number of credit card transactions are unlawful, which is due to the probability of events leading to data imbalance.

Results

Discussion

Conclusion