Popularity Prediction Tool for ATLAS Distributed Data Management

T Beermann,P Maettig,M Barisits,G Stewart,V Garonne,R Vigne,C Serfon,A Nairz,A Molfetas,L Goossens,M Lassnig

doi:10.1088/1742-6596/513/4/042004

Abstract

This paper describes a popularity prediction tool for data-intensive data management systems, such as ATLAS distributed data management (DDM). It is fed by the DDM popularity system, which produces historical reports about ATLAS data usage, providing information about files, datasets, users and sites where data was accessed. The tool described in this contribution uses this historical information to make a prediction about the future popularity of data. It finds trends in the usage of data using a set of neural networks and a set of input parameters and predicts the number of accesses in the near term future. This information can then be used in a second step to improve the distribution of replicas at sites, taking into account the cost of creating new replicas (bandwidth and load on the storage system) compared to gain of having new ones (faster access of data for analysis). To evaluate the benefit of the redistribution a grid simulator is introduced that is able replay real workload on different data distributions. This article describes the popularity prediction method and the simulator that is used to evaluate the redistribution.

Highlights

The ATLAS[1] collaboration creates and manages vast amounts of data
Don Quijote 2 (DQ2) organizes, transfers and manages the detector’s RAW data, and the entire life cycle of derived data products for the collaboration’s physicists. This is done in accordance with the policies established in the ATLAS Computing Model
In this article we describe a new, dynamic way of pro-actively replicating data, which is based on predictions about the future access of datasets by analysing the popularity of datasets in the past

Summary

Introduction

The ATLAS[1] collaboration creates and manages vast amounts of data. Since the detector started data taking, Don Quijote 2 (DQ2)[2], the collaboration’s distributed data management system, is responsible for managing petabytes of experiment data on over 750 storage end points in the Worldwide LHC Computing grid[3]. Users send their jobs to the workload management system, PanDA, which schedules the job to run on a grid site. To run analysis jobs on this data send their jobs to the ATLAS workload management system called PanDA.

Results

Conclusion