Abstract
Reference architectures for Big Data, machine learning and stream processing include not only recommended practices and interconnected building blocks but considerations for scalability, availability, manageability, and security as well. However, the automated deployment of multi-VM platforms on various clouds leveraging on such reference architectures may raise several issues. The paper focuses particularly on the widespread Apache Spark Big Data platform as the baseline and the Occopus cloud-agnostic orchestrator tool. The set of new generation reference architectures are configurable by human-readable descriptors according to available resources and cloud-providers, and offers various components such as Jupyter Notebook, RStudio, HDFS, and Kafka. These pre-configured reference architectures can be automatically deployed even by the data scientist on-demand, using a multi-cloud approach for a wide range of cloud systems like Amazon AWS, Microsoft Azure, OpenStack, OpenNebula, CloudSigma, etc. Occopus enables the scaling of cluster-oriented components (such as Spark) of the instantiated reference architectures. The presented solution was successfully used in the Hungarian Comparative Agendas Project (CAP) by the Institute for Political Science to classify newspaper articles.
Highlights
Cloud-based Big Data and Machine Learning (ML) applications [1,2] are becoming increasingly popular in the industry, in academic and education sectors
An important advantage of the proposed solution is that the main param eters of the Apache Spark architecture can be customized, the computing capacity required for processing can be scaled and cloud-independent
We extended the reference architecture with more components, including Hadoop Distributed File System (HDFS), RStudio, Python, and Kafka
Summary
Cloud-based Big Data and Machine Learning (ML) applications [1,2] are becoming increasingly popular in the industry, in academic and education sectors. An important advantage of the proposed solution is that the main param eters of the Apache Spark architecture (such as the size of the cluster, number of CPU cores and memory configurations per workers, etc.) can be customized, the computing capacity required for processing can be scaled and cloud-independent. To fulfill these goals we used a hybridcloud orchestration tool called Occopus [7], which was developed by SZTAKI. We per formed the first set of measurements for benchmarking the scalability features of the Spark cluster in an instantiated reference architectures on the ELKH Cloud (see section 7)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have