Fine-grained data caching approaches to speedup a distributed RDataFrame analysis

Vincenzo Eduardo Padulano,Pedro Alonso-Jordá,Enric Tejedor Saavedra,C Biscarat,G.A Stewart,S Campana,B Hegner,C.I Rovelli,S Roiser

doi:10.1051/epjconf/202125102027

Vincenzo Eduardo Padulano, Pedro Alonso-Jordá + Show 7 more

Open Access

https://doi.org/10.1051/epjconf/202125102027

Copy DOI

Abstract

Thanks to its RDataFrame interface, ROOT now supports the execution of the same physics analysis code both on a single machine and on a cluster of distributed resources. In the latter scenario, it is common to read the input ROOT datasets over the network from remote storage systems, which often increases the time it takes for physicists to obtain their results. Storing the remote files much closer to where the computations will run can bring latency and execution time down. Such a solution can be improved further by caching only the actual portion of the dataset that will be processed on each machine in the cluster, reusing it in subsequent executions on the same input data. This paper shows the benefits of applying different means of caching input data in a distributed ROOT RDataFrame analysis. Two such mechanisms will be applied to this kind of workflow with different configurations, namely caching on the same nodes that process data or caching on a separate server.

Highlights

The high amount of data collected by the LHC experiments has made distributed computing a staple in High Energy Physics (HEP) data processing workflows for a long time, with the WLCG [1] being the prime example of efforts in that direction
Each scenario in turn presents three tests: the baseline test with caching disabled, one test with XRootD cache enabled on a server separate from the computing nodes and one test with TFilePrefetch cache enabled on the local filesystem of the computing nodes
The XRootD framework is quite well established in the community and its proxy plugin system may be used to cache remote files closer to the computing nodes

Summary

Introduction

The high amount of data collected by the LHC experiments has made distributed computing a staple in High Energy Physics (HEP) data processing workflows for a long time, with the WLCG [1] being the prime example of efforts in that direction. It will be crucial to make the most out of current and future architectures In this regard, distributed computing will need to be revisited with new approaches, algorithms and frameworks. Letting the user interactively explore their dataset even as it grows larger and larger will be a requirement in many physics analysis groups Services such as SWAN [4] try to solve that need, providing a modern interactive interface for analysis through Jupyter notebooks and the possibility to run on distributed cluster resources on demand

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EPJ Web of Conferences	Publication Date: Jan 1, 2021
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Fine-grained data caching approaches to speedup a distributed RDataFrame analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ Web of Conferences

Lead the way for us

Similar Papers

ISCSI Protocol Parameter Optimization for Mobile Appliance Remote Storage System at Smart Home Environment Approach
Shaikh Muhammad Allayear ... Sung Soon Park
-
Shaikh Muhammad Allayear, et. al.Shaikh Muhammad Allayear ... Sung Soon Park
05 Dec 2006
05 Dec 2006

Intelligent Agricultural Information Remote Data Storage Method Based on Block Chain
Kun Wang
-
Kun WangKun Wang
01 Jan 2020
01 Jan 2020

Optimizing iSCSI Parameters for Improving the Performance of iSCSI based Mobile Appliance in Wireless Network
Ja-Won Seo ... Hae-Sun Shin
-
Ja-Won Seo, et. al. Ja-Won Seo ... Hae-Sun Shin
18 Jun 2015
18 Jun 2015

Multiple streams of UDT and HpFP protocols for high-bandwidth remote storage system in long fat network
Ken T Murata ... Kazuya Muranaga
-
Ken T Murata, et. al.Ken T Murata ... Kazuya Muranaga
01 Oct 2016
01 Oct 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Fine-grained data caching approaches to speedup a distributed RDataFrame analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ Web of Conferences