Delivering a machine learning course on HPC resources

Stefano Bagnasco,Stefano Lusso,Sara Vallero,Gabriele Gaetano Fronzé,Federica Legger,C Doglioni,P Jackson,G.A Stewart,W Kamleh,L Silvestris,D Kim

doi:10.1051/epjconf/202024508016

Stefano Bagnasco, Stefano Lusso + Show 9 more

Open Access

https://doi.org/10.1051/epjconf/202024508016

Copy DOI

Abstract

In recent years, proficiency in data science and machine learning (ML) became one of the most requested skills for jobs in both industry and academy. Machine learning algorithms typically require large sets of data to train the models and extensive usage of computing resources, both for training and inference. Especially for deep learning algorithms, training performances can be dramatically improved by exploiting Graphical Processing Units (GPUs). The needed skill set for a data scientist is therefore extremely broad, and ranges from knowledge of ML models to distributed programming on heterogeneous resources. While most of the available training resources focus on ML algorithms and tools such as TensorFlow, we designed a course for doctoral students where model training is tightly coupled with underlying technologies that can be used to dynamically provision resources. Throughout the course, students have access to a dedicated cluster of computing nodes on local premises. A set of libraries and helper functions is provided to execute a parallelized ML task by automatically deploying a Spark driver and several Spark execution nodes as Docker containers. Task scheduling is managed by an orchestration layer (Kubernetes). This solution automates the delivery of the software stack required by a typical ML workflow and enables scalability by allowing the execution of ML tasks, including training, over commodity (i.e. CPUs) or high-performance (i.e. GPUs) resources distributed over different hosts across a network. The adaptation of the same model on OCCAM, the HPC facility at the University of Turin, is currently under development.

Highlights

Data science is one of the fastest growing fields of information technology, with wide applications in key sectors such as research, industry, and public administration
Task scheduling is managed by an orchestration layer (Kubernetes). This solution automates the delivery of the software stack required by a typical machine learning (ML) workflow and enables scalability by allowing the execution of ML tasks, including training, over commodity (i.e. CPUs) or high-performance (i.e. Graphical Processing Units (GPUs)) resources distributed over different hosts across a network
At the University of Turin we introduced for the academic year 2018-2019 a new course for doctoral students, with the title “Big Data Science and Machine Learning”, to bridge the gap between the development and optimisation of ML models, and the exploitation of a distributed computing infrastructure

Summary

Introduction

Data science is one of the fastest growing fields of information technology, with wide applications in key sectors such as research, industry, and public administration. The volume of data produced by business, science, humans and machines alike has been growing exponentially in the past decade, and it is expected to keep on following this trend in the near future. The training and optimisation of ML and DL models may require substantial computing power, often exceeding that of a single machine. At the University of Turin we introduced for the academic year 2018-2019 a new course for doctoral students, with the title “Big Data Science and Machine Learning”, to bridge the gap between the development and optimisation of ML models, and the exploitation of a distributed computing infrastructure. We will discuss the course goals and program, the hands-on sessions, the computing infrastructure used throughout the course, and future developments

The course

The hands-on sessions

The computing infrastructure

Scaling tests

Details of the benchmark ML models

Future developments

Findings

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Delivering a machine learning course on HPC resources

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ web of conferences

Lead the way for us

Journal: EPJ web of conferences	Publication Date: Jan 1, 2020
License type: CC BY 4.0

Similar Papers

Distributed Machine Learning Workflow with PanDA and iDDS in LHC ATLAS
Wen Guan ... Xin Zhao
EPJ web of conferences | VOL. 295
Wen Guan, et. al.Wen Guan ... Xin Zhao
01 Jan 2024
EPJ web of conferences | VOL. 295

Mobile-Edge-Computing-Based Hierarchical Machine Learning Tasks Distribution for IIoT
Bo Yang ... Qinqing Zhang
IEEE Internet of Things Journal | VOL. 7
Bo Yang, et. al.Bo Yang ... Qinqing Zhang
01 Mar 2020
IEEE Internet of Things Journal | VOL. 7

From CNN to DNN Hardware Accelerators: A Survey on Design, Exploration, Simulation, and Frameworks
Fernando Gehm Moraes ... Rafael Garibotti
-
Fernando Gehm Moraes, et. al.Fernando Gehm Moraes ... Rafael Garibotti
01 Jan 2023
01 Jan 2023

Predicting protein-lipid interactions through machine learning methods employing new tokenization techniques.
Carlos R Cuellar Rodriguez ... Emad Tajkhorshid
Biophysical journal | VOL. 122
Carlos R Cuellar Rodriguez, et. al.Carlos R Cuellar Rodriguez ... Emad Tajkhorshid
01 Feb 2023
Biophysical journal | VOL. 122

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Delivering a machine learning course on HPC resources

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ web of conferences