Abstract

With the development of the new generation of High Energy Physics (HEP) experiments, huge amounts of data are being generated. Efficient parallel algorithms/frameworks and High IO throughput are key to meet the scalability and performance requirements of HEP offline data analysis. Though Hadoop has gained a lot of attention from scientific community for its scalability and parallel computing framework for large data sets, it’s still difficult to make HEP data processing tasks run directly on Hadoop. In this paper we investigate the application of Hadoop to make HEP jobs run on it transparently. Particularly, we discuss a new mechanism to support HEP software to random access data in HDFS. Because HDFS is streaming data stored only supporting sequential write and append. It cannot satisfy HEP jobs to random access data. This new feature allows the Map/Reduce tasks to random read/write on the local file system on data nodes instead of using Hadoop data streaming interface. This makes HEP jobs run on Hadoop possible. We also develop diverse MapReduce model for HEP jobs such as Corsika simulation, ARGO detector simulation and Medea++ reconstruction. And we develop a toolkit for users to submit/query/remove jobs. In addition, we provide cluster monitoring and account system to benefit to the system availability. This work has been in production for HEP experiment to gain about 40,000 CPU hours per month since September, 2016.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call