Abstract

With the exponential growth of the amount of data being generated, stored and processed on a daily basis in the machine learning, data analytics and decision-making systems, the data preprocessing established itself as the key factor for building reliable high-performance machine learning models. One of the roles in preprocessing is variable reduction using feature selection methods; however, the processing time needed for these methods is a major drawback. This study aims at mitigating this problem by migrating the algorithm to a MapReduce implementation suitable for parallelization on a high number of commodity hardware units. The genetic algorithm-based methods were put in the focus of this study. Hadoop, an open-source MapReduce library, was used as a framework for implementing parallel genetic algorithms within our research. The representative machine learning methods, SVM (support vector machine), ANN (artificial neural network), RT (random tree), logistic regression and Naive Bayes, were embedded into implementation for feature selection. The feature selection methods were applied to four NSL-KDD data sets, and the number of features is reduced from cca 40 to cca 10 data sets with the accuracy of 90.45%. These results have both significant practical and theoretical impact. On the one hand, the genetic algorithm has been parallelized in the MapReduce manner, which has been considered unachievable in a strict sense. Furthermore, the genetic algorithm allows randomness-enhanced feature selection and its parallelization reduces overall data preprocessing and allows larger population count which in turn leads to better feature selection. On the practical side, it has been shown that this implementation outperforms the existing feature selection methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call