Evolutionary undersampling for extremely imbalanced big data classification under apache spark

I Triguero,J Maillo,F Herrera,H Bustince,M Galar,D Merino

doi:10.1109/cec.2016.7743853

I Triguero, J Maillo + Show 4 more

Open Access

https://doi.org/10.1109/cec.2016.7743853

Copy DOI

Abstract

The classification of datasets with a skewed class distribution is an important problem in data mining. Evolutionary undersampling of the majority class has proved to be a successful approach to tackle this issue. Such a challenging task may become even more difficult when the number of the majority class examples is very big. In this scenario, the use of the evolutionary model becomes unpractical due to the memory and time constrictions. Divide-and-conquer approaches based on the MapReduce paradigm have already been proposed to handle this type of problems by dividing data into multiple subsets. However, in extremely imbalanced cases, these models may suffer from a lack of density from the minority class in the subsets considered. Aiming at addressing this problem, in this contribution we provide a new big data scheme based on the new emerging technology Apache Spark to tackle highly imbalanced datasets. We take advantage of its in-memory operations to diminish the effect of the small sample size. The key point of this proposal lies in the independent management of majority and minority class examples, allowing us to keep a higher number of minority class examples in each subset. In our experiments, we analyze the proposed model with several data sets with up to 17 million instances. The results show the goodness of this evolutionary undersampling model for extremely imbalanced big data classification.

Highlights

In the recent years, the amount of information that can be automatically gathered is inexorably growing in multiple fields such as bioinformatics, social media or physics
We propose a big data scheme for extremely imbalance problems implemented under Apache Spark, which aims at solving the lack of density problem in our previous model
The aim of this paper is tackle both issues by designing an imbalance big data model, which relies on the flexibility an in-memory operations of Apache Spark

Summary

Introduction

The amount of information that can be automatically gathered is inexorably growing in multiple fields such as bioinformatics, social media or physics. This research topic is referred to under the term: big data [1]. Big data learning poses a significant challenge to the research community because standard data mining models cannot deal with the volume, diversity and complexity that these data bring up [2]. The MapReduce framework [3], and its open-source implementation in Hadoop [4], were the first alternatives to. The MapReduce programming paradigm [3] is a scalable data processing tool designed by Google in 2003. It was designed to be part of the most powerful search-engine on the Internet, but it rapidly became one of the most effective techniques for general-purpose data intensive applications

Objectives

Methods

Findings

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Evolutionary undersampling for extremely imbalanced big data classification under apache spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jul 1, 2016
Citations: 77	License type: other-oa

Similar Papers

SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling
Hongjiao Guan ... Xianglong Tang
Applied Intelligence | VOL. 51
Hongjiao Guan, et. al.Hongjiao Guan ... Xianglong Tang
25 Sep 2020
Applied Intelligence | VOL. 51

Deep Learning for Imbalanced Multimedia Data Classification
Yilin Yan ... Min Chen
-
Yilin Yan, et. al.Yilin Yan ... Min Chen
01 Dec 2015
01 Dec 2015

Evolutionary Undersampling for Classification with Imbalanced Datasets: Proposals and Taxonomy
Salvador García ... Francisco Herrera
Evolutionary Computation | VOL. 17
Salvador García, et. al.Salvador García ... Francisco Herrera
01 Sep 2009
Evolutionary Computation | VOL. 17

CLASSIFICATION BOOSTING IN IMBALANCED DATA
Sinta Septi Pangastuti ... Nur Iriawan
Malaysian Journal of Science | VOL. 38
Sinta Septi Pangastuti, et. al.Sinta Septi Pangastuti ... Nur Iriawan
30 Sep 2019
Malaysian Journal of Science | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evolutionary undersampling for extremely imbalanced big data classification under apache spark

Abstract

Highlights

Summary

Talk to us

Similar Papers