Abstract

Machine Learning (ML) has been growing in popularity in multiple areas and groups at CERN, covering fast simulation, tracking, anomaly detection, among many others. We describe a new service available at CERN, based on Kubeflow and managing the full ML lifecycle: data preparation and interactive analysis, large scale distributed model training and model serving. We cover specific features available for hyper-parameter tuning and model metadata management, as well as infrastructure details to integrate accelerators and external resources. We also present results and a cost evaluation from scaling out a popular ML use case using public cloud resources, achieving close to linear scaling when using a large number of GPUs.

Highlights

  • In recent years there have been multiple efforts to apply machine learning (ML) techniques to solve different tasks and problems in High Energy Physics (HEP), covering fast alternatives to Monte Carlo based simulation [1], anomaly detection algorithms applied to the search of rare new physics [2], and fast inference models in 40 MHz scouting [3]

  • We present a new service recently made available to the CERN community that tries to help with improving the overall process of machine learning

  • Machine learning usage has been growing significantly in the HEP community, and is one of the most promising areas where progress can help tackle the upcoming challenges for future experiments

Read more

Summary

Introduction

In recent years there have been multiple efforts to apply machine learning (ML) techniques to solve different tasks and problems in High Energy Physics (HEP), covering fast alternatives to Monte Carlo based simulation [1], anomaly detection algorithms applied to the search of rare new physics [2], and fast inference models in 40 MHz scouting [3]. The level of investment in this area has led to the spread of libraries and platforms available, popular ones including TensorFlow, PyTorch or scikit-learn Even though they have well-established communities backing them, the overall process where they are put in use is a lot less clear. We present a new service recently made available to the CERN community that tries to help with improving the overall process of machine learning. It is based on Kubeflow, a machine learning toolkit offering components to handle each of the required steps of data loading and pre-processing; efficient distributed model training, storage and versioning; and model serving. Repository with a set of examples, covering notebooks, pipelines, distributed training, model storage, model serving, is provided at https://gitlab.cern.ch/ai-ml/examples

Service Objectives
Infrastructure and Kubeflow
Python and Jupyter Notebooks
Pipelines
Automated Hyper Parameter Optimization
Storage Integration
Model Serving
Sample Use Case
Public Cloud
Results and Cost Analysis
Conclusions and Future
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call