Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem

Eréndira Rendón,Everardo E Granda-Gutiérrez,Roberto Alejo,Frank J Isidro-Ortega,Carlos Castorena

doi:10.3390/app10041276

Eréndira Rendón, Everardo E Granda-Gutiérrez + Show 3 more

Open Access

https://doi.org/10.3390/app10041276

Copy DOI

Abstract

The class imbalance problem has been a hot topic in the machine learning community in recent years. Nowadays, in the time of big data and deep learning, this problem remains in force. Much work has been performed to deal to the class imbalance problem, the random sampling methods (over and under sampling) being the most widely employed approaches. Moreover, sophisticated sampling methods have been developed, including the Synthetic Minority Over-sampling Technique (SMOTE), and also they have been combined with cleaning techniques such as Editing Nearest Neighbor or Tomek’s Links (SMOTE+ENN and SMOTE+TL, respectively). In the big data context, it is noticeable that the class imbalance problem has been addressed by adaptation of traditional techniques, relatively ignoring intelligent approaches. Thus, the capabilities and possibilities of heuristic sampling methods on deep learning neural networks in big data domain are analyzed in this work, and the cleaning strategies are particularly analyzed. This study is developed on big data, multi-class imbalanced datasets obtained from hyper-spectral remote sensing images. The effectiveness of a hybrid approach on these datasets is analyzed, in which the dataset is cleaned by SMOTE followed by the training of an Artificial Neural Network (ANN) with those data, while the neural network output noise is processed with ENN to eliminate output noise; after that, the ANN is trained again with the resultant dataset. Obtained results suggest that best classification outcome is achieved when the cleaning strategies are applied on an ANN output instead of input feature space only. Consequently, the need to consider the classifier’s nature when the classical class imbalance approaches are adapted in deep learning and big data scenarios is clear.

Highlights

The huge amount of data continuously generated by digital applications is an important challenge for the machine learning field [1]
The main contributions of this research are: (a) This paper is focused in dealing with multi-class imbalance problems, which have hardly been investigated [50,51,52] and they are critical issues in the field of data classification [45,53]. (b) It addresses one of the most popular deep learning methods (Artificial Neural Networks, ANN), a specialized research topic, with details on some particular aspects of the classifier, such as answering the question: Is it pertinent to use methods that work in the features space of ANN classifiers that set the decision boundary in the hidden space? (c) Results that notice the effectiveness of applying editing methods on the output neural network in order to improve the deep neural network classification performance are presented
See the Salinas dataset, which is a high imbalanced dataset and its g-mean value is relatively high (0.8848); i.e, it is affected by the class imbalance, but this problem does not induce a poor classification performance

Summary

Introduction

The huge amount of data continuously generated by digital applications is an important challenge for the machine learning field [1]. Machine learning and deep learning algorithms are strongly affected by the class imbalance problem [11,12,13,14,15] The latter refers to some difficulties that appear when the number of samples in one or more classes in the dataset is fewer than another class (or classes), thereby producing an important deterioration of the classifier performance [16]. Proposes a solution for breast cancer diagnosis: to use decision trees and Multi-Layer Perceptrons as base classifiers in order to build an ensemble similar to the Easy Ensemble algorithm, and sub-sampling methods to deal with the class imbalance problem. More studies are need in order to test methods that traditionally present good performances in machine learning (such as the random and heuristic sampling algorithms) at the big data scale. The potential of traditional sampling methods on deep learning neural networks in the big data context is studied in this work. The main contributions of this research are: (a) This paper is focused in dealing with multi-class imbalance problems, which have hardly been investigated [50,51,52] and they are critical issues in the field of data classification [45,53]. (b) It addresses one of the most popular deep learning methods (Artificial Neural Networks, ANN), a specialized research topic, with details on some particular aspects of the classifier, such as answering the question: Is it pertinent to use methods that work in the features space of ANN classifiers that set the decision boundary in the hidden space? (c) Results that notice the effectiveness of applying editing methods on the output neural network in order to improve the deep neural network classification performance are presented

Deep Learning Multi-Layer Perceptron

Sampling Class Imbalance Approaches

Over-Sampling Methods

Under-Sampling Methods

Hybrid Sampling Class Imbalance Strategies

Database Description

Parameter Specification for the Algorithms Used in the Experimentation

Classifier Performance and Tests of Statistical Significance

Experimental Results and Discussion

Method

Conclusions and Future Work

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Feb 14, 2020
Citations: 65	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Addressing the Big Data Multi-class Imbalance Problem with Oversampling and Deep Learning Neural Networks
V M González-Barcenas ... E E Granda-Gutiérrez
-
V M González-Barcenas, et. al.V M González-Barcenas ... E E Granda-Gutiérrez
01 Jan 2019
01 Jan 2019

Software Defect Prediction Using Neural Network Based SMOTE
Rizal Broer Bahaweres ... Fajar Agustian
-
Rizal Broer Bahaweres, et. al.Rizal Broer Bahaweres ... Fajar Agustian
01 Oct 2020
01 Oct 2020

Performance Analysis of Machine Learning Methods with Class Imbalance Problem in Android Malware Detection
Abimbola Ganiyat Akintola ... Hammed Adeleke Mojeed
International Journal of Interactive Mobile Technologies (iJIM) | VOL. 16
Abimbola Ganiyat Akintola, et. al.Abimbola Ganiyat Akintola ... Hammed Adeleke Mojeed
24 May 2022
International Journal of Interactive Mobile Technologies (iJIM) | VOL. 16

Two Novel SMOTE Methods for Solving Imbalanced Classification Problems
Yuan Bao ... Sibo Yang
IEEE Access | VOL. 11
Yuan Bao, et. al.Yuan Bao ... Sibo Yang
01 Jan 2023
IEEE Access | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences