Cloud-agnostic architectures for machine learning based on Apache Spark

Enikő Nagy,Róbert Lovas,István Pintye,Ákos Hajnal,Péter Kacsuk

doi:10.1016/j.advengsoft.2021.103029

Abstract

Reference architectures for Big Data, machine learning and stream processing include not only recommended practices and interconnected building blocks but considerations for scalability, availability, manageability, and security as well. However, the automated deployment of multi-VM platforms on various clouds leveraging on such reference architectures may raise several issues. The paper focuses particularly on the widespread Apache Spark Big Data platform as the baseline and the Occopus cloud-agnostic orchestrator tool. The set of new generation reference architectures are configurable by human-readable descriptors according to available resources and cloud-providers, and offers various components such as Jupyter Notebook, RStudio, HDFS, and Kafka. These pre-configured reference architectures can be automatically deployed even by the data scientist on-demand, using a multi-cloud approach for a wide range of cloud systems like Amazon AWS, Microsoft Azure, OpenStack, OpenNebula, CloudSigma, etc. Occopus enables the scaling of cluster-oriented components (such as Spark) of the instantiated reference architectures. The presented solution was successfully used in the Hungarian Comparative Agendas Project (CAP) by the Institute for Political Science to classify newspaper articles.

Highlights

Cloud-based Big Data and Machine Learning (ML) applications [1,2] are becoming increasingly popular in the industry, in academic and education sectors
An important advantage of the proposed solution is that the main param eters of the Apache Spark architecture can be customized, the computing capacity required for processing can be scaled and cloud-independent
We extended the reference architecture with more components, including Hadoop Distributed File System (HDFS), RStudio, Python, and Kafka

Summary

Introduction

Cloud-based Big Data and Machine Learning (ML) applications [1,2] are becoming increasingly popular in the industry, in academic and education sectors. An important advantage of the proposed solution is that the main param eters of the Apache Spark architecture (such as the size of the cluster, number of CPU cores and memory configurations per workers, etc.) can be customized, the computing capacity required for processing can be scaled and cloud-independent. To fulfill these goals we used a hybridcloud orchestration tool called Occopus [7], which was developed by SZTAKI. We per formed the first set of measurements for benchmarking the scalability features of the Spark cluster in an instantiated reference architectures on the ELKH Cloud (see section 7)

Related work

Apache Hadoop

Apache Spark

Occopus cloud orchestrator

Spark cluster deployment and scaling by Occopus

RStudio with Spark

Jupyter Notebook with Python and Spark

Stream processing

Validation by the Hungarian Comparative Agendas Project

Experimental results with various ML libraries

Conclusions and future work

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Advances in engineering software (Barking, London, England : 1992)	Publication Date: Jun 5, 2021
Citations: 6	License type: cc-by

R Discovery Prime

R Discovery Prime

Cloud-agnostic architectures for machine learning based on Apache Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Advances in engineering software (Barking, London, England : 1992)

Lead the way for us

Similar Papers

Big data and machine learning framework for clouds and its usage for text classification
István Pintye ... Róbert Lovas
Concurrency and computation : practice & experience | VOL. 33
István Pintye, et. al.István Pintye ... Róbert Lovas
21 Dec 2020
Concurrency and computation : practice & experience | VOL. 33

Survey of Deep and Extreme Learning Machines for Big data Classification
M Deepa ... M Raja Lakshmi
Asian journal of research in social sciences and humanities | VOL. 6
M Deepa, et. al.M Deepa ... M Raja Lakshmi
01 Jan 2015
Asian journal of research in social sciences and humanities | VOL. 6

The State of Big Data Reference Architectures: A Systematic Literature Review
Pouya Ataei ... Alan Litchfield
IEEE access : practical innovations, open solutions | VOL. 10
Pouya Ataei, et. al.Pouya Ataei ... Alan Litchfield
01 Jan 2021
IEEE access : practical innovations, open solutions | VOL. 10

Efficient Training of Transfer Mapping in Physics-Infused Machine Learning Models of UAV Acoustic Field
Rayhaan Iqbal ... Amir Behjat
-
Rayhaan Iqbal, et. al.Rayhaan Iqbal ... Amir Behjat
03 Jan 2022
03 Jan 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cloud-agnostic architectures for machine learning based on Apache Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Advances in engineering software (Barking, London, England : 1992)