Accelerating GAN training using highly parallel hardware on public cloud

Renato Cardoso,Dejan Golubovic,Ignacio Peluaga Lozada,João Fernandes,Sofia Vallecorsa,Ricardo Rocha,C Biscarat,G.A Stewart,S Roiser,B Hegner,C.I Rovelli,S Campana

doi:10.1051/epjconf/202125102073

Abstract

With the increasing number of Machine and Deep Learning applications in High Energy Physics, easy access to dedicated infrastructure represents a requirement for fast and efficient R&D. This work explores different types of cloud services to train a Generative Adversarial Network (GAN) in a parallel environment, using Tensorflow data parallel strategy. More specifically, we parallelize the training process on multiple GPUs and Google Tensor Processing Units (TPU) and we compare two algorithms: the TensorFlow built-in logic and a custom loop, optimised to have higher control of the elements assigned to each GPU worker or TPU core. The quality of the generated data is compared to Monte Carlo simulation. Linear speed-up of the training process is obtained, while retaining most of the performance in terms of physics results. Additionally, we benchmark the aforementioned approaches, at scale, over multiple GPU nodes, deploying the training process on different public cloud providers, seeking for overall efficiency and cost-effectiveness. The combination of data science, cloud deployment options and associated economics allows to burst out heterogeneously, exploring the full potential of cloud-based services.

Highlights

Several studies have demonstrated the benefits of using Deep Learning (DL) to solve different tasks related to data processing in High Energy Physics (HEP): for example, generative models are being tested as fast alternatives to Monte Carlo based simulation and anomaly detection algorithms are being explored to design searches for rare new-physics processes
With Deep Learning models in the HEP domain becoming more complex, computational requirements increase, triggering the need to consider HPCaaS and MLaaS offers in the public cloud
This work presents results of the first deployment of a three-dimensional convolutional Generative Adversarial Network (GAN) for detector simulation on Tensor Processing Units (TPU), demonstrating an efficient parallelization of the adversarial training process: the 3DGAN training time is brought down from about a week to around one hour. This result enables large architecture hyper-parameter scans to run in just a few days and, it greatly extends the range of detector geometries than can be simulated by the 3DGAN model

Summary

Introduction

Several studies have demonstrated the benefits of using Deep Learning (DL) to solve different tasks related to data processing in High Energy Physics (HEP): for example, generative models are being tested as fast alternatives to Monte Carlo based simulation and anomaly detection algorithms are being explored to design searches for rare new-physics processes. For this reason, in addition to the optimisation of the specific physics use case, we provide relevant feedback on the most efficient models to develop an end-to-end integrated strategy that includes cost prediction and optimization when consuming public cloud resources at scale, for scientific research. Our deployments are among the first examples of establishing an innovative hybrid platform successfully deploying scientific deep learning workloads on public clouds at large scale [7]

Generative Adversarial Networks in High Energy Physics

The adversarial training process acceleration

Training on Google Tensor Processing Units

Results

Automated deployment using public cloud services

Google Cloud Platform with Kubeflow

Experiments and Results

Cost Analysis Overview

Microsoft Azure with Azure’s Machine Learning Service

Conclusions and Future Plans

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EPJ Web of Conferences	Publication Date: Jan 1, 2021
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Accelerating GAN training using highly parallel hardware on public cloud

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ Web of Conferences

Lead the way for us

Similar Papers

Fiber optic sensors for relative humidity monitoring in High Energy Physics applications
M Consales ... S Buontempo
-
M Consales, et. al.M Consales ... S Buontempo
01 May 2014
01 May 2014

A SystemVerilog-UVM Methodology for the Design, Simulation and Verification of Complex Readout Chips in High Energy Physics Applications
Tomasz Hemperek ... Jorgen Christiansen
-
Tomasz Hemperek, et. al.Tomasz Hemperek ... Jorgen Christiansen
01 Jan 2017
01 Jan 2017

Event Classification with Quantum Machine Learning in High-Energy Physics
Tomoe Kishimoto ... Koji Terashi
Computing and Software for Big Science | VOL. 5
Tomoe Kishimoto, et. al.Tomoe Kishimoto ... Koji Terashi
03 Jan 2021
Computing and Software for Big Science | VOL. 5

Comparison of Bayesian and particle swarm algorithms for hyperparameter optimisation in machine learning applications in high energy physics
Christian Veelken ... Laurits Tani
Computer Physics Communications | VOL. 294
Christian Veelken, et. al.Christian Veelken ... Laurits Tani
06 Oct 2023
Computer Physics Communications | VOL. 294

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Accelerating GAN training using highly parallel hardware on public cloud

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ Web of Conferences