Abstract

With the increasing number of Machine and Deep Learning applications in High Energy Physics, easy access to dedicated infrastructure represents a requirement for fast and efficient R&D. This work explores different types of cloud services to train a Generative Adversarial Network (GAN) in a parallel environment, using Tensorflow data parallel strategy. More specifically, we parallelize the training process on multiple GPUs and Google Tensor Processing Units (TPU) and we compare two algorithms: the TensorFlow built-in logic and a custom loop, optimised to have higher control of the elements assigned to each GPU worker or TPU core. The quality of the generated data is compared to Monte Carlo simulation. Linear speed-up of the training process is obtained, while retaining most of the performance in terms of physics results. Additionally, we benchmark the aforementioned approaches, at scale, over multiple GPU nodes, deploying the training process on different public cloud providers, seeking for overall efficiency and cost-effectiveness. The combination of data science, cloud deployment options and associated economics allows to burst out heterogeneously, exploring the full potential of cloud-based services.

Highlights

  • Several studies have demonstrated the benefits of using Deep Learning (DL) to solve different tasks related to data processing in High Energy Physics (HEP): for example, generative models are being tested as fast alternatives to Monte Carlo based simulation and anomaly detection algorithms are being explored to design searches for rare new-physics processes

  • With Deep Learning models in the HEP domain becoming more complex, computational requirements increase, triggering the need to consider HPCaaS and MLaaS offers in the public cloud

  • This work presents results of the first deployment of a three-dimensional convolutional Generative Adversarial Network (GAN) for detector simulation on Tensor Processing Units (TPU), demonstrating an efficient parallelization of the adversarial training process: the 3DGAN training time is brought down from about a week to around one hour. This result enables large architecture hyper-parameter scans to run in just a few days and, it greatly extends the range of detector geometries than can be simulated by the 3DGAN model

Read more

Summary

Introduction

Several studies have demonstrated the benefits of using Deep Learning (DL) to solve different tasks related to data processing in High Energy Physics (HEP): for example, generative models are being tested as fast alternatives to Monte Carlo based simulation and anomaly detection algorithms are being explored to design searches for rare new-physics processes. For this reason, in addition to the optimisation of the specific physics use case, we provide relevant feedback on the most efficient models to develop an end-to-end integrated strategy that includes cost prediction and optimization when consuming public cloud resources at scale, for scientific research. Our deployments are among the first examples of establishing an innovative hybrid platform successfully deploying scientific deep learning workloads on public clouds at large scale [7]

Generative Adversarial Networks in High Energy Physics
The adversarial training process acceleration
Training on Google Tensor Processing Units
Results
Automated deployment using public cloud services
Google Cloud Platform with Kubeflow
Experiments and Results
Cost Analysis Overview
Microsoft Azure with Azure’s Machine Learning Service
Conclusions and Future Plans
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call