A Preliminary Study of the Feasibility of Global Evolutionary Feature Selection for Big Datasets under Apache Spark

M Galar,I Triguero,H Bustince,F Herrera

doi:10.1109/cec.2018.8477878

Abstract

Designing efficient learning models capable of dealing with tons of data has become a reality in the era of big data. However, the amount of available data is too much for traditional data mining techniques to be applicable. This issue is even more serious when evolutionary algorithms are a key part of the learning algorithm. In this scenario, one typical approach is to follow a divide-and-conquer strategy, where data is divided into different chunks that are individually and independently addressed. Afterwards, the partial knowledge obtained from each chunk of data is combined in order to give a solution to the problem. Nevertheless, these kinds of local approaches do not look at data as a whole, missing a global view of the problem, which may result in less accurate models that also depend on how data is split. In this work, we focus on evolutionary feature selection algorithms. A divide-and-conquer approach to handle evolutionary feature selection in big data was already developed. We aim at designing its global counterpart, which looks at the feature selection problem from a global perspective, making use of the data as a whole to select the most appropriate features. In order to do so, we consider Apache Spark as a big data technology where our algorithm is implemented. We design a genetic algorithm capable of dealing with big datasets by selecting the proper parameters for our base algorithm (the well-known CHC) and adapting the evaluation procedure to take all the distributed data into account. Several preliminary results are discussed to study the feasibility of global evolutionary feature selection methods for big datasets.

Highlights

Machine learning algorithms are supposed to improve their performance as long as more data is considered.in practice, several challenges are found when learning algorithms are applied to big datasets due to memory and time limitations [1]
In order to avoid these drawbacks, in this work we focus on the design and study of feasibility of a global evolutionary feature selection model
Spark is developing even more efficient APIs such as DataFrames and Datasets, which are always based on Resilient Distributed Datasets (RDDs). These new APIs make use of the capabilities offered by Catalyst optimizer and Tungsten memory management, which allows data to be stored outside the Java Virtual Machine (JVM) making both the storage and computation more efficient

Summary

INTRODUCTION

Machine learning algorithms are supposed to improve their performance as long as more data is considered. They suffer from scalability issues when dealing with big datasets For this reason, previous works have overcome this problem by developing a local evolutionary feature selection model [15]. In order to avoid these drawbacks, in this work we focus on the design and study of feasibility of a global evolutionary feature selection (global EFS) model. This is possible thanks to technologies such as Apache Spark, which allows us to take multiple iterations over the same data without much overhead.

Frameworks for Big Data processing

Feature Selection in the Big Data context

A GLOBAL EVOLUTIONARY FEATURE SELECTION WITH APACHE SPARK

EXPERIMENTAL STUDY

Experimental set-up

Preliminary results and discussion

Analysis of the behavior of global EFS

Findings

CONCLUDING REMARKS

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Preliminary Study of the Feasibility of Global Evolutionary Feature Selection for Big Datasets under Apache Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jul 1, 2018
Citations: 23	License type: other-oa

Similar Papers

A Surrogate-Assisted Evolutionary Feature Selection Algorithm With Parallel Random Grouping for High-Dimensional Classification
Shulei Liu ... Wei Peng
IEEE Transactions on Evolutionary Computation | VOL. 26
Shulei Liu, et. al.Shulei Liu ... Wei Peng
01 Oct 2022
IEEE Transactions on Evolutionary Computation | VOL. 26

Cooperative co-evolution for feature selection in Big Data with random feature grouping
A N M Bazlur Rashid ... Paul Haskell-Dowland
Journal of Big Data | VOL. 7
A N M Bazlur Rashid, et. al.A N M Bazlur Rashid ... Paul Haskell-Dowland
01 Dec 2020
Journal of Big Data | VOL. 7

Tuning Active Sampling Techniques for Evolutionary Learner from Big Data Sets: Review and Discussion
Sana Ben Hamida ... Marta Rukoz
-
Sana Ben Hamida, et. al.Sana Ben Hamida ... Marta Rukoz
01 Jul 2016
01 Jul 2016

Legal Governance of Brain Data Derived from Artificial Intelligence
Mahika Ahluwalia
Voices in Bioethics | VOL. 7
Mahika AhluwaliaMahika Ahluwalia
02 Jun 2021
Voices in Bioethics | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Preliminary Study of the Feasibility of Global Evolutionary Feature Selection for Big Datasets under Apache Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers