Machine learning of network metrics in ATLAS Distributed Data Management

Mario Lassnig,Wesley Toler,Ralf Vamosi,Joaquin Bogado

doi:10.1088/1742-6596/898/6/062009

Abstract

The increasing volume of physics data poses a critical challenge to the ATLAS experiment. In anticipation of high luminosity physics, automation of everyday data management tasks has become necessary. Previously many of these tasks required human decision-making and operation. Recent advances in hardware and software have made it possible to entrust more complicated duties to automated systems using models trained by machine learning algorithms. In this contribution we show results from one of our ongoing automation efforts that focuses on network metrics. First, we describe our machine learning framework built atop the ATLAS Analytics Platform. This framework can automatically extract and aggregate data, train models with various machine learning algorithms, and eventually score the resulting models and parameters. Second, we use these models to forecast metrics relevant for networkaware job scheduling and data brokering. We show the characteristics of the data and evaluate the forecasting accuracy of our models.

Highlights

The data taken by the ATLAS Experiment [1] at the Large Hadron Collider (LHC) are stored and distributed via the Worldwide LHC Computing Grid (WLCG), a network of computing centres and users across the globe
Due to the heterogeneous nature of the WLCG, the transfer model depends on a complex amalgamation of different systems, each with different queuing, execution, fail-over, and retrial strategies [2]
One of the potential solutions is to place experiment data depending on their potential usage instead of just using the fixed distribution percentages from their associated computational workflows. This should lead to fewer transfers and higher overall system throughput. Such new strategies require the usage of infrastructure reliability and performance metrics in the data placement algorithm in a dynamic feedback cycle

Summary

Introduction

The data taken by the ATLAS Experiment [1] at the Large Hadron Collider (LHC) are stored and distributed via the Worldwide LHC Computing Grid (WLCG), a network of computing centres and users across the globe. Due to the heterogeneous nature of the WLCG, the transfer model depends on a complex amalgamation of different systems, each with different queuing, execution, fail-over, and retrial strategies [2] This can lead to wildly varying data transfer times and eventually to infrastructure under-utilisation and user dissatisfaction. One of the potential solutions is to place experiment data depending on their potential usage instead of just using the fixed distribution percentages from their associated computational workflows This should lead to fewer transfers and higher overall system throughput. Such new strategies require the usage of infrastructure reliability and performance metrics in the data placement algorithm in a dynamic feedback cycle. There is no sensible model available to estimate the duration of a particular collection of file transfers between data centres

Interval Source

Conclusion and future work