Abstract

Reference architectures for Big Data, machine learning and stream processing include not only recommended practices and interconnected building blocks but considerations for scalability, availability, manageability, and security as well. However, the automated deployment of multi-VM platforms on various clouds leveraging on such reference architectures may raise several issues. The paper focuses particularly on the widespread Apache Spark Big Data platform as the baseline and the Occopus cloud-agnostic orchestrator tool. The set of new generation reference architectures are configurable by human-readable descriptors according to available resources and cloud-providers, and offers various components such as Jupyter Notebook, RStudio, HDFS, and Kafka. These pre-configured reference architectures can be automatically deployed even by the data scientist on-demand, using a multi-cloud approach for a wide range of cloud systems like Amazon AWS, Microsoft Azure, OpenStack, OpenNebula, CloudSigma, etc. Occopus enables the scaling of cluster-oriented components (such as Spark) of the instantiated reference architectures. The presented solution was successfully used in the Hungarian Comparative Agendas Project (CAP) by the Institute for Political Science to classify newspaper articles.

Highlights

  • Cloud-based Big Data and Machine Learning (ML) applications [1,2] are becoming increasingly popular in the industry, in academic and education sectors

  • An important advantage of the proposed solution is that the main param­ eters of the Apache Spark architecture can be customized, the computing capacity required for processing can be scaled and cloud-independent

  • We extended the reference architecture with more components, including Hadoop Distributed File System (HDFS), RStudio, Python, and Kafka

Read more

Summary

Introduction

Cloud-based Big Data and Machine Learning (ML) applications [1,2] are becoming increasingly popular in the industry, in academic and education sectors. An important advantage of the proposed solution is that the main param­ eters of the Apache Spark architecture (such as the size of the cluster, number of CPU cores and memory configurations per workers, etc.) can be customized, the computing capacity required for processing can be scaled and cloud-independent. To fulfill these goals we used a hybridcloud orchestration tool called Occopus [7], which was developed by SZTAKI. We per­ formed the first set of measurements for benchmarking the scalability features of the Spark cluster in an instantiated reference architectures on the ELKH Cloud (see section 7)

Related work
Apache Hadoop
Apache Spark
Occopus cloud orchestrator
Spark cluster deployment and scaling by Occopus
RStudio with Spark
Jupyter Notebook with Python and Spark
Stream processing
Validation by the Hungarian Comparative Agendas Project
Experimental results with various ML libraries
Conclusions and future work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call