Big data execution time based on Spark Machine Learning Libraries

Anna Karen Gárate-Escamilla,Amir Hajjam El Hassani,Emmanuel Andres

doi:10.1145/3358505.3358519

Abstract

The paper focuses on exploring the time consumption of supervised and unsupervised models of Apache Spark framework in massive datasets. Big Data analytics has been relevant in the industry due to the need to convert information into knowledge. Among the challenge of big data is the creation of strategies to improve the execution costs of running machine learning models to make a prediction. Apache Spark is a powerful in-memory platform that offers an extensive machine learning library for regression, classification, clustering, and rule extraction. This investigation, from a computation cost perspective, performs different experiments using real datasets. The main contribution of the paper is to compare the execution time of different machine learning models, such as random forests, decision tree, logistic regression, linear support vector machine, and kNN. The present work expects to combine the areas of big data and machine learning, comparing the results with different configurations and the use of the optimization methods, cache and persist. The evaluation experiments show that logistic regression performed the shortest execution time of the Spark MLlib models.

Full Text