Apache Spark usage and deployment models for scientific computing

Diogo Castro,Piotr Mrowczynski,Enric Tejedor,Prasanth Kothuri,Danilo Piparo

doi:10.1051/epjconf/201921407020

Diogo Castro, Piotr Mrowczynski + Show 3 more

Open Access

https://doi.org/10.1051/epjconf/201921407020

Copy DOI

Abstract

This talk is about sharing our recent experiences in providing data analytics platform based on Apache Spark for High Energy Physics, CERN accelerator logging system and infrastructure monitoring. The Hadoop Service has started to expand its user base for researchers who want to perform analysis with big data technologies. Among many frameworks, Apache Spark is currently getting the most traction from various user communities and new ways to deploy Spark such as Apache Mesos or Spark on Kubernetes have started to evolve rapidly. Meanwhile, notebook web applications such as Jupyter offer the ability to perform interactive data analytics and visualizations without the need to install additional software. CERN already provides a web platform, called SWAN (Service for Web-based ANalysis), where users can write and run their analyses in the form of notebooks, seamlessly accessing the data and software they need. The first part of the presentation talks about several recent integrations and optimizations to the Apache Spark computing platform to enable HEP data processing and CERN accelerator logging system analytics. The optimizations and integrations, include, but not limited to, access of kerberized resources, xrootd connector enabling remote access to EOS storage and integration with SWAN for interactive data analysis, thus forming a truly Unified Analytics Platform. The second part of the talk touches upon the evolution of the Apache Spark data analytics platform, particularly sharing the recent work done to run Spark on Kubernetes on the virtualized and container-based infrastructure in Openstack. This deployment model allows for elastic scaling of data analytics workloads enabling efficient, on-demand utilization of resources in private or public clouds.

Highlights

Large Hadron Collider (LHC) is in an era of excellent performance delivering collisions at an ever increasing rate which increases the amount of information recorded by LHC experiments
In this paper we present the recent changes and innovations of data analysis infrastructure built around Apache Spark
The Hadoop [6] and Spark service provided by CERN IT is used by the IT Monitoring service which is critical for CC operations and WLCG, IT Security for intrusion detection, LHC experiments (CMS, ATLAS) for the analytics on computing data and more recently by CERN Beams department who are developing the generation of the CERN accelerator logging platform

Summary

Introduction

Large Hadron Collider (LHC) is in an era of excellent performance delivering collisions at an ever increasing rate which increases the amount of information recorded by LHC experiments. The burgeoning size of the datasets is leading the High Energy Physics (HEP) community to modernize the analysis infrastructure with the new approaches developed in the industry One such distributed data analytics engine that is gaining wide adaption across CERN [1] accelerator sector, physics researchers and IT infrastructure is Apache Spark [2]. Spark supports multiple widely used programming languages (Python, Java, Scala, and R), includes libraries for diverse tasks ranging from SQL to streaming and machine learning, and runs anywhere from a laptop to a cluster of thousands of servers This makes it an easy system to start with and scale-up to big data processing of incredibly large scale.

Integration of SWAN with Spark Clusters

Spark Connector

Spark Monitor

HDFS Browser

Authentication and Encryption

Apache Spark deployment models

Decoupling Compute and Storage for Big Data

Provisioning of Spark on Kubernetes cluster

Spark Kubernetes Operator – Managing the lifecycle of Spark Applications

Evaluation of Spark on Kubernetes

Conclusions and Future work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EPJ Web of Conferences	Publication Date: Jan 1, 2019
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Apache Spark usage and deployment models for scientific computing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ Web of Conferences

Lead the way for us

Similar Papers

SWAT
Max Grossman ... Vivek Sarkar
-
Max Grossman, et. al.Max Grossman ... Vivek Sarkar
31 May 2016
31 May 2016

Azure Databricks
Leila Etaati
-
Leila EtaatiLeila Etaati
01 Jan 2019
01 Jan 2019

Lessons from Large-Scale Software as a Service at Databricks
Matei Zaharia
-
Matei ZahariaMatei Zaharia
20 Nov 2019
20 Nov 2019

Monitoring WLCG with lambda-architecture: a new scalable data store and analytics platform for monitoring at petabyte scale.
L Magnoni ... M Georgiou
Journal of Physics: Conference Series | VOL. 664
L Magnoni, et. al.L Magnoni ... M Georgiou
01 Dec 2015
Journal of Physics: Conference Series | VOL. 664

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Apache Spark usage and deployment models for scientific computing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ Web of Conferences