Abstract

Nowadays, many disciplines have to deal with big datasets that additionally involve a high number of features. Feature selection methods aim at eliminating noisy, redundant, or irrelevant features that may deteriorate the classification performance. However, traditional methods lack enough scalability to cope with datasets of millions of instances and extract successful results in a delimited time. This paper presents a feature selection algorithm based on evolutionary computation that uses the MapReduce paradigm to obtain subsets of features from big datasets. The algorithm decomposes the original dataset in blocks of instances to learn from them in the map phase; then, the reduce phase merges the obtained partial results into a final vector of feature weights, which allows a flexible application of the feature selection procedure using a threshold to determine the selected subset of features. The feature selection method is evaluated by using three well-known classifiers (SVM, Logistic Regression, and Naive Bayes) implemented within the Spark framework to address big data problems. In the experiments, datasets up to 67 millions of instances and up to 2000 attributes have been managed, showing that this is a suitable framework to perform evolutionary feature selection, improving both the classification accuracy and its runtime when dealing with big data problems.

Highlights

  • Learning from very large databases is a major issue for most of the current data mining and machine learning algorithms [1]

  • This paper presents a feature selection algorithm based on evolutionary computation that uses the MapReduce paradigm to obtain subsets of features from big datasets

  • The main objective of this paper is to enable Evolutionary Feature Selection (EFS) models to be applied on big data

Read more

Summary

Introduction

Learning from very large databases is a major issue for most of the current data mining and machine learning algorithms [1] This problem is commonly named with the term “big data,” which refers to the difficulties and disadvantages of processing and analyzing huge amounts of data [2,3,4]. The MapReduce paradigm [8, 9] and its distributed file system [10], originally introduced by Google, offer an effective and robust framework to address the analysis of big datasets This approach is currently taken into consideration in data mining, rather than other parallelization schemes such as MPI (Message Passing Interface) [11], because of its fault-tolerant mechanism and its simplicity. This reduction facilitates the understanding of the patterns extracted and increases the speed of posterior learning stages

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call