Feature Selection and Ensemble Learning Techniques in One-Class Classifiers: An Empirical Study of Two-Class Imbalanced Datasets

Chih-Fong Tsai,Wei-Chao Lin

doi:10.1109/access.2021.3051969

Abstract

Class imbalance learning is an important research problem in data mining and machine learning. Most solutions including data levels, algorithm levels, and cost sensitive approaches are derived using multi-class classifiers, depending on the number of classes to be classified. One-class classification (OCC) techniques, in contrast, have been widely used for anomaly or outlier detection where only normal or positive class training data are available. In this study, we treat every two-class imbalanced dataset as an anomaly detection problem, which contains a larger number of data in the majority class, i.e. normal or positive class, and a very small number of data in the minority class. The research objectives of this paper are to understand the performance of OCC classifiers and examine the level of performance improvement when feature selection is considered for pre-processing the training data in the majority class and ensemble learning is employed to combine multiple OCC classifiers. Based on 55 datasets with different ranges of class imbalance ratios and one-class support vector machine, isolation forest, and local outlier factor as the representative OCC classifiers, we found that the OCC classifiers are good at high imbalance ratio datasets, outperforming the C4.5 baseline. In most cases, though, performing feature selection does not improve the performance of the OCC classifiers in most. However, many homogeneous and heterogeneous OCC classifier ensembles do outperform the single OCC classifiers, with some specific combinations of multiple OCC classifiers, both with and without feature selection, performing similar to or better than the baseline combination of SMOTE and C4.5.

Highlights

Many real-world domain problem datasets are class imbalanced, meaning that the numbers of data in different classes are not the same
While many related works focus on the data level, algorithm level, and cost sensitivity solutions, very few consider one-class classification (OCC) techniques, which have been widely used in anomaly or outlier detection where only the normal class data are available
We conduct an empirical study of the performance of three representative OCC classifiers, i.e. one-class support vector machine (OCSVM), isolation forest (IFOREST), and local outlier factor (LOF), and their ensembles based on 55 different two-class datasets containing different imbalance ratios ranging from 1.82 to 129.44

Summary

Introduction

Many real-world domain problem datasets are class imbalanced, meaning that the numbers of data in different classes are not the same. Each class imbalanced dataset has at least one of the following three characteristics, small sample size, overlapping (or class separability), and small disjuncts [6], [7]. A small sample size means that there are not enough examples in the minority class, which can cause an imbalanced class distribution. The concept represented by the minority class is formed of sub-concepts, which are located differently in the feature space and the amount of instances among them is not usually balanced. This increases the complexity of the problem to be solved

Objectives

Results

Conclusion