Abstract

The High Luminosity phase of the LHC, which aims for a tenfold increase in the luminosity of proton-proton collisions is expected to start operation in eight years. An unprecedented scientific data volume at the multiexabyte scale will be delivered to particle physics experiments at CERN. This amount of data has to be stored and the corresponding technology must ensure fast and reliable data delivery for processing by the scientific community all over the world. The present LHC computing model will not be able to provide the required infrastructure growth even taking into account the expected hardware evolution. To address this challenge the Data Lake R&D project has been launched by the DOMA community in the fall of 2019. State-of-the-art data handling technologies are under active development, and their current status for the Russian Scientific Data Lake prototype is presented here.

Highlights

  • Modern high energy and nuclear physics experiments, which are being carried out at the accelerator complexes of the LHC (CERN, Switzerland) [1], SuperKEKB (KEK, Japan), RHIC (BNL, USA), and in the coming years at NICA (JINR, Russia) and FAIR (GSI, Germany), deal with hundreds of petabytes of scientific data located in billions of files

  • Starting from 2014 the LHC experiments analyze annually more than two exabytes of physics data processing millions of files per day, and the total permanent storage capacity for the experiments has exceeded the exabyte level

  • A set of different files is taken, in which one file is repeated no more than 20 times

Read more

Summary

Introduction

Modern high energy and nuclear physics experiments, which are being carried out at the accelerator complexes of the LHC (CERN, Switzerland) [1], SuperKEKB (KEK, Japan), RHIC (BNL, USA), and in the coming years at NICA (JINR, Russia) and FAIR (GSI, Germany), deal with hundreds of petabytes of scientific data located in billions of files. Such enormous data volumes and the amount of files require new techniques for their storage and processing [2][3], and one of the proposed approaches is to use the “data lake” concept [4][5]. The architecture development and prototyping of “scientific data lakes” is being carried out as a part of the DOMA [6] project (WLCG [7], CERN). The development of data management systems and data streams for processing and analyzing information in the exabyte range;. This work is carried out by Russian research centers (NRC KI - PNPI, ISP RAS) and universities (SPbSU, MEPhI, RUE) in cooperation with international scientific centers (CERN, JINR, LAPP, UNAB)

Problem description and challenges
The prototype architecture
Data caching
Data buffering
The prototype implementation
Testing methodology
The monitoring system
Results and Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call