Big data scalability based on Spark Machine Learning Libraries

Anna Karen Garate-Escamilla,Amir Hajjam El Hassani,Emmanuel Andres

doi:10.1145/3372454.3372469

Abstract

The paper introduces the challenge of scalability in machine learning algorithms suitable for massive datasets. Today, big data has relevant applications in the industry due to improvements in the system performance and by turning information into knowledge. Big data challenges include the lack of strategies to process computational cost and the large amount of data when computing machine learning predictions. To overcome these scalability issues, it is convenient to work with distributed and parallelized architecture across multiple nodes. The approach is based on Apache Spark, an in-memory distributed application that offers extensive machine learning libraries. The main contribution of the study is to measure the scalability by calculating the execution time that a classifier achieves with larger workloads. We validate our classifier models with experiments on logistic regression and random forest by studying their adaptability to the Apache Spark framework. The present work expects to combine the areas of big data and machine learning on scalability, and the use of optimization methods, cache and persist. In addition, a comparison between the classifiers is provided. The evaluation experiments show that logistic regression performed the shortest execution time and best scalability.

Full Text