Abstract

In this era of big data, processing large scale data efficiently and accurately has become a challenging problem. Ensemble classification is a type of supervised learning that uses multiple experts to generate the final output. It provides a way to classify data more accurately. As a result of using multiple classifiers, they are often more complicated than single classifiers, especially for big data problems. Apache Spark is a unified analytics engine for big data processing which provides a scalable framework to analyze the data. In this paper, we first extend our previous work and design a distributed heterogeneous ensemble classifier inspired by the boosting approach, which is capable of dealing with big datasets. Using heterogeneous classifiers makes it possible to have more diverse classifiers, and consequently, a more accurate classifier is obtained. Then, we present the Spark version of the proposed approach to speed up our heterogeneous ensemble classifier using the MapReduce paradigm. In order to evaluate our approach, we have applied it to seven big datasets. Extensive experimental results indicate the superiority of the proposed method over the existing ensemble algorithms implemented by Spark MLlib in terms of the classification accuracy, performance, and scalability.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.