Big data classification using heterogeneous ensemble classifiers in Apache Spark based on MapReduce paradigm

Hamidreza Kadkhodaei,Amir Masoud Eftekhari Moghadam,Mehdi Dehghan

doi:10.1016/j.eswa.2021.115369

Abstract

In this era of big data, processing large scale data efficiently and accurately has become a challenging problem. Ensemble classification is a type of supervised learning that uses multiple experts to generate the final output. It provides a way to classify data more accurately. As a result of using multiple classifiers, they are often more complicated than single classifiers, especially for big data problems. Apache Spark is a unified analytics engine for big data processing which provides a scalable framework to analyze the data. In this paper, we first extend our previous work and design a distributed heterogeneous ensemble classifier inspired by the boosting approach, which is capable of dealing with big datasets. Using heterogeneous classifiers makes it possible to have more diverse classifiers, and consequently, a more accurate classifier is obtained. Then, we present the Spark version of the proposed approach to speed up our heterogeneous ensemble classifier using the MapReduce paradigm. In order to evaluate our approach, we have applied it to seven big datasets. Extensive experimental results indicate the superiority of the proposed method over the existing ensemble algorithms implemented by Spark MLlib in terms of the classification accuracy, performance, and scalability.

Full Text