Abstract

The ATLAS Experiment at the LHC generates petabytes of data that is distributed among 160 computing sites all over the world and is processed continuously by various central production and user analysis tasks. The popularity of data is typically measured as the number of accesses and plays an important role in resolving data management issues: deleting, replicating, moving between tapes, disks and caches. These data management procedures were still carried out in a semi-manual mode and now we have focused our efforts on automating it, making use of the historical knowledge about existing data management strategies. In this study we describe sources of information about data popularity and demonstrate their consistency. Based on the calculated popularity measurements, various distributions were obtained. Auxiliary information about replication and task processing allowed us to evaluate the correspondence between the number of tasks with popular data executed per site and the number of replicas per site. We also examine the popularity of user analysis data that is much less predictable than in the central production and requires more indicators than just the number of accesses.

Highlights

  • A dataset in the ATLAS experiment [1] at the LHC is the aggregation of multiple files in one logical and operational unit in a distributed computing environment

  • EOS is one of the storage systems used in the Worldwide LHC Computing Grid (WLCG)

  • PanDA DB is a database system serving PanDA. It registers the comprehensive historical and operating meta-information about all physics analysis tasks, jobs being executed within the distributed computing environment of the ATLAS experiment

Read more

Summary

Introduction

A dataset in the ATLAS experiment [1] at the LHC is the aggregation of multiple files in one logical and operational unit in a distributed computing environment. A possible reason that they are not used is due to insufficient integration of the following sources of information about ATLAS data: DDM (Distributed Data Management) Rucio [4], Rucio Traces [5], EOS [6] Report Logs, WMS (Workload Management System) PanDA [7]. These sources have specific sets of metrics that can be used to assess the popularity of datasets. We demonstrate how popularity can be evaluated based on different sources of information and which auxiliary metrics can be calculated for further integration work

Sources for Data Popularity Measurements
DDM Rucio
EOS Report Logs
PanDA Database
Consistency Check of Data Popularity Metrics
Analysis of the ATLAS EOS instance at CERN Data Center using EOS Report Logs
Rucio Access Metrics for Detector and Monte-Carlo Data
Popularity of ATLAS User Analysis Data
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.