Methods of Data Popularity Evaluation in the ATLAS Experiment at the LHC

Thomas Beermann,Olga Chuchuk,Andrea Sciaba,Eugeny Tretyakov,Maria Grigorieva,Alexei Klimentov,Mario Lassnig,Alessandro Di Girolamo,Markus Schulz

doi:10.1051/epjconf/202125102013

Abstract

The ATLAS Experiment at the LHC generates petabytes of data that is distributed among 160 computing sites all over the world and is processed continuously by various central production and user analysis tasks. The popularity of data is typically measured as the number of accesses and plays an important role in resolving data management issues: deleting, replicating, moving between tapes, disks and caches. These data management procedures were still carried out in a semi-manual mode and now we have focused our efforts on automating it, making use of the historical knowledge about existing data management strategies. In this study we describe sources of information about data popularity and demonstrate their consistency. Based on the calculated popularity measurements, various distributions were obtained. Auxiliary information about replication and task processing allowed us to evaluate the correspondence between the number of tasks with popular data executed per site and the number of replicas per site. We also examine the popularity of user analysis data that is much less predictable than in the central production and requires more indicators than just the number of accesses.

Highlights

A dataset in the ATLAS experiment [1] at the LHC is the aggregation of multiple files in one logical and operational unit in a distributed computing environment
EOS is one of the storage systems used in the Worldwide LHC Computing Grid (WLCG)
PanDA DB is a database system serving PanDA. It registers the comprehensive historical and operating meta-information about all physics analysis tasks, jobs being executed within the distributed computing environment of the ATLAS experiment

Summary

Introduction

A dataset in the ATLAS experiment [1] at the LHC is the aggregation of multiple files in one logical and operational unit in a distributed computing environment. A possible reason that they are not used is due to insufficient integration of the following sources of information about ATLAS data: DDM (Distributed Data Management) Rucio [4], Rucio Traces [5], EOS [6] Report Logs, WMS (Workload Management System) PanDA [7]. These sources have specific sets of metrics that can be used to assess the popularity of datasets. We demonstrate how popularity can be evaluated based on different sources of information and which auxiliary metrics can be calculated for further integration work

Sources for Data Popularity Measurements

DDM Rucio

EOS Report Logs

PanDA Database

Consistency Check of Data Popularity Metrics

Analysis of the ATLAS EOS instance at CERN Data Center using EOS Report Logs

Rucio Access Metrics for Detector and Monte-Carlo Data

Popularity of ATLAS User Analysis Data

Findings

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EPJ Web of Conferences	Publication Date: Jan 1, 2021
Citations: 6	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Methods of Data Popularity Evaluation in the ATLAS Experiment at the LHC

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ Web of Conferences

Lead the way for us

Similar Papers

Electronic Data Collection and Management System for Global Adult Tobacco Survey
Sameer Pujari ... Krishna Mohan Palipudi
Online Journal of Public Health Informatics | VOL. 4
Sameer Pujari, et. al.Sameer Pujari ... Krishna Mohan Palipudi
13 Sep 2012
Online Journal of Public Health Informatics | VOL. 4

Dynamic Data Storage and Management Strategies for Distributed File System
Feng Liu ... Yuan Gao
-
Feng Liu, et. al.Feng Liu ... Yuan Gao
01 Jan 2020
01 Jan 2020

Data management: The building blocks of clean, accurate and reliable longitudinal datasets
Anna Graves ... Eliza Fraser
International Journal of Multiple Research Approaches | VOL. 1
Anna Graves, et. al.Anna Graves ... Eliza Fraser
01 Dec 2007
International Journal of Multiple Research Approaches | VOL. 1

How do we know? An assessment of integrated community case management data quality in four districts of Malawi.
Jennifer Yourkavitch ... Debra Prosnitz
Health policy and planning | VOL. 31
Jennifer Yourkavitch, et. al.Jennifer Yourkavitch ... Debra Prosnitz
09 May 2016
Health policy and planning | VOL. 31

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Methods of Data Popularity Evaluation in the ATLAS Experiment at the LHC

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ Web of Conferences