ENSEMBLE META CLASSIFIER WITH SAMPLING AND FEATURE SELECTION FOR DATA WITH IMBALANCE MULTICLASS PROBLEM

Mohd Shamrie Sainin,Rayner Alfred,Faudziah Ahmad

doi:10.32890/jict2021.20.2.1

Abstract

Ensemble learning by combining several single classifiers or another ensemble classifier is one of the procedures to solve the imbalance problem in multiclass data. However, this approach still faces the question of how the ensemble methods obtain their higher performance. In this paper, an investigation was carried out on the design of the meta classifier ensemble with sampling and feature selection for multiclass imbalanced data. The specific objectives were: 1) to improve the ensemble classifier through data-level approach (sampling and feature selection); 2) to perform experiments on sampling, feature selection, and ensemble classifier model; and 3 ) to evaluate t he performance of the ensemble classifier. To fulfil the objectives, a preliminary data collection of Malaysian plantsâ€™ leaf images was prepared and experimented, and the results were compared. The ensemble design was also tested with three other high imbalance ratio benchmark data. It was found that the design using sampling, feature selection, and ensemble classifier method via AdaboostM1 with random forest (also an ensemble classifier) provided improved performance throughout the investigation. The result of this study is important to the on-going problem of multiclass imbalance where specific structure and its performance can be improved in terms of processing time and accuracy.

Highlights

With the advancement of the industrial revolution 4.0, more data are being captured, stored, processed, and analysed
Sampling with an ensemble classifier performed almost similar to the combination of the three components
As mentioned in the earlier section, multiclass imbalance is still an on-going problem in real-world data mining and machine learning when data are greatly affected by a high imbalance ratio between samples where one or more classes have fewer samples while the other classes have too many samples

Summary

Introduction

With the advancement of the industrial revolution 4.0, more data are being captured, stored, processed, and analysed. A multiclass classification problem refers to assigning one of the several class labels with an input object. Unlike binary classification, learning a multiclass problem is a more complex task since each example can only be assigned to exactly one class label. There are three categories of methods proposed for learning multiclass classification problems, namely 1) direct multiclass classification technique using a single classifier; 2) binary conversion classification techniques; and 3) hierarchical classification techniques. A direct classifier is any algorithm that can be applied naturally to solve multiclass classification problems directly, such as neural network, decision tree, k-Nearest neighbour (k-NN), Naive Bayes (NB), and support vector machine (SVM) (Mehra & Gupta, 2013). If the process requires several steps to change, select, and preprocess certain data before the classification, it is called an indirect method, or identified as a hybrid approach

Objectives

Methods

Results

Conclusion