Abstract

Apache Spark[1] is one of the predominant frameworks in the big data space, providing a fully-functional query processing engine, vendor support for hardware accelerators, and performant integrations with scientific computing libraries. One difficulty in adopting conventional big data frameworks to HEP workflows is the lack of support for the ROOT file format in these frameworks. Laurelin[6] implements ROOT I/O with a pure Java library, with no bindings to the C++ ROOT[2] implementation, and is readily installable via standard Java packaging tools. It provides a performant interface enabling Spark to read (and soon write) ROOT TTrees, enabling users to process these data without a pre-processing phase converting to an intermediate format.

Highlights

  • High Energy Physics (HEP) experiments, like those performed at the Large Hadron Collider (LHC), generate an enormous amount of data which must be captured, stored, refined, and later analyzed by thousands of users

  • Reconstruction passes are performed over the initial data to ‘connect the dots’ of detector hits to determine the most probable types and trajectories of particles passing through the detector. This process results in datasets that are apt to probe the physical processes that occurred at the interaction point

  • When a user requests a new DataFrame backed by ROOT files, Spark will greedily instantiate a ‘Table’ class to represent the dataset backed by those files

Read more

Summary

Introduction

High Energy Physics (HEP) experiments, like those performed at the Large Hadron Collider (LHC), generate an enormous amount of data which must be captured, stored, refined, and later analyzed by thousands of users These data are produced from the collisions of billions of subatomic particles at speeds approaching the speed of light, as well as sophisticated simulations of the various expected physical processes. Reconstruction passes are performed over the initial data to ‘connect the dots’ of detector hits to determine the most probable types and trajectories of particles passing through the detector This process results in datasets that are apt to probe the physical processes that occurred at the interaction point. Because many tools like Spark were developed outside of HEP, there is little native support for the ROOT file format in the Big Data ecosystem. In HEP, this is the exception rather than the rule because most analyses will focus on a small fraction of the particles from any given Event

Architecture
ROOT file format
Apache Spark DataSource API
Components
Interpretation
Spark DataSource
Conclusions and Next Steps
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call