Abstract

In recent years, proficiency in data science and machine learning (ML) became one of the most requested skills for jobs in both industry and academy. Machine learning algorithms typically require large sets of data to train the models and extensive usage of computing resources, both for training and inference. Especially for deep learning algorithms, training performances can be dramatically improved by exploiting Graphical Processing Units (GPUs). The needed skill set for a data scientist is therefore extremely broad, and ranges from knowledge of ML models to distributed programming on heterogeneous resources. While most of the available training resources focus on ML algorithms and tools such as TensorFlow, we designed a course for doctoral students where model training is tightly coupled with underlying technologies that can be used to dynamically provision resources. Throughout the course, students have access to a dedicated cluster of computing nodes on local premises. A set of libraries and helper functions is provided to execute a parallelized ML task by automatically deploying a Spark driver and several Spark execution nodes as Docker containers. Task scheduling is managed by an orchestration layer (Kubernetes). This solution automates the delivery of the software stack required by a typical ML workflow and enables scalability by allowing the execution of ML tasks, including training, over commodity (i.e. CPUs) or high-performance (i.e. GPUs) resources distributed over different hosts across a network. The adaptation of the same model on OCCAM, the HPC facility at the University of Turin, is currently under development.

Highlights

  • Data science is one of the fastest growing fields of information technology, with wide applications in key sectors such as research, industry, and public administration

  • Task scheduling is managed by an orchestration layer (Kubernetes). This solution automates the delivery of the software stack required by a typical machine learning (ML) workflow and enables scalability by allowing the execution of ML tasks, including training, over commodity (i.e. CPUs) or high-performance (i.e. Graphical Processing Units (GPUs)) resources distributed over different hosts across a network

  • At the University of Turin we introduced for the academic year 2018-2019 a new course for doctoral students, with the title “Big Data Science and Machine Learning”, to bridge the gap between the development and optimisation of ML models, and the exploitation of a distributed computing infrastructure

Read more

Summary

Introduction

Data science is one of the fastest growing fields of information technology, with wide applications in key sectors such as research, industry, and public administration. The volume of data produced by business, science, humans and machines alike has been growing exponentially in the past decade, and it is expected to keep on following this trend in the near future. The training and optimisation of ML and DL models may require substantial computing power, often exceeding that of a single machine. At the University of Turin we introduced for the academic year 2018-2019 a new course for doctoral students, with the title “Big Data Science and Machine Learning”, to bridge the gap between the development and optimisation of ML models, and the exploitation of a distributed computing infrastructure. We will discuss the course goals and program, the hands-on sessions, the computing infrastructure used throughout the course, and future developments

The course
The hands-on sessions
The computing infrastructure
Scaling tests
Details of the benchmark ML models
Future developments
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call