Training and Serving ML workloads with Kubeflow at CERN

Dejan Golubovic,Ricardo Rocha,C Biscarat,B Hegner,S Campana,S Roiser,C.I Rovelli,G.A Stewart

doi:10.1051/epjconf/202125102067

Dejan Golubovic, Ricardo Rocha + Show 6 more

Open Access

https://doi.org/10.1051/epjconf/202125102067

Copy DOI

Abstract

Machine Learning (ML) has been growing in popularity in multiple areas and groups at CERN, covering fast simulation, tracking, anomaly detection, among many others. We describe a new service available at CERN, based on Kubeflow and managing the full ML lifecycle: data preparation and interactive analysis, large scale distributed model training and model serving. We cover specific features available for hyper-parameter tuning and model metadata management, as well as infrastructure details to integrate accelerators and external resources. We also present results and a cost evaluation from scaling out a popular ML use case using public cloud resources, achieving close to linear scaling when using a large number of GPUs.

Highlights

In recent years there have been multiple efforts to apply machine learning (ML) techniques to solve different tasks and problems in High Energy Physics (HEP), covering fast alternatives to Monte Carlo based simulation [1], anomaly detection algorithms applied to the search of rare new physics [2], and fast inference models in 40 MHz scouting [3]
We present a new service recently made available to the CERN community that tries to help with improving the overall process of machine learning
Machine learning usage has been growing significantly in the HEP community, and is one of the most promising areas where progress can help tackle the upcoming challenges for future experiments

Summary

Introduction

In recent years there have been multiple efforts to apply machine learning (ML) techniques to solve different tasks and problems in High Energy Physics (HEP), covering fast alternatives to Monte Carlo based simulation [1], anomaly detection algorithms applied to the search of rare new physics [2], and fast inference models in 40 MHz scouting [3]. The level of investment in this area has led to the spread of libraries and platforms available, popular ones including TensorFlow, PyTorch or scikit-learn Even though they have well-established communities backing them, the overall process where they are put in use is a lot less clear. We present a new service recently made available to the CERN community that tries to help with improving the overall process of machine learning. It is based on Kubeflow, a machine learning toolkit offering components to handle each of the required steps of data loading and pre-processing; efficient distributed model training, storage and versioning; and model serving. Repository with a set of examples, covering notebooks, pipelines, distributed training, model storage, model serving, is provided at https://gitlab.cern.ch/ai-ml/examples

Service Objectives

Infrastructure and Kubeflow

Python and Jupyter Notebooks

Pipelines

Automated Hyper Parameter Optimization

Storage Integration

Model Serving

Sample Use Case

Public Cloud

Results and Cost Analysis

Conclusions and Future

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EPJ web of conferences	Publication Date: Jan 1, 2021
Citations: 8	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Training and Serving ML workloads with Kubeflow at CERN

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ web of conferences

Lead the way for us

Similar Papers

Optimizing Machine Learning Workloads in Collaborative Environments
Behrouz Derakhshan ... Alireza Rezaei Mahdiraji
-
Behrouz Derakhshan, et. al.Behrouz Derakhshan ... Alireza Rezaei Mahdiraji
31 May 2020
31 May 2020

Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization
Mathias Lecuyer ... Siddhartha Sen
-
Mathias Lecuyer, et. al.Mathias Lecuyer ... Siddhartha Sen
01 May 2017
01 May 2017

In-Memory Computing in Emerging Memory Technologies for Machine Learning: An Overview
Kaushik Roy ... Indranil Chakraborty
-
Kaushik Roy, et. al.Kaushik Roy ... Indranil Chakraborty
01 Jul 2020
01 Jul 2020

Distributed Machine Learning on IAAS Clouds
Ta Nguyen Binh Duong ... Nguyen Quang Sang
-
Ta Nguyen Binh Duong, et. al.Ta Nguyen Binh Duong ... Nguyen Quang Sang
01 Nov 2018
01 Nov 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Training and Serving ML workloads with Kubeflow at CERN

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ web of conferences