Abstract

Data generation rates are expected to grow very fast for some database workloads going into LHC run 2 and beyond. In particular this is expected for data coming from controls, logging and monitoring systems. Storing, administering and accessing big data sets in a relational database system can quickly become a very hard technical challenge, as the size of the active data set and the number of concurrent users increase. Scale-out database technologies are a rapidly developing set of solutions for deploying and managing very large data warehouses on commodity hardware and with open source software. In this paper we will describe the architecture and tests on database systems based on Hadoop and the Cloudera Impala engine. We will discuss the results of our tests, including tests of data loading and integration with existing data sources and in particular with relational databases. We will report on query performance tests done with various data sets of interest at CERN, notably data from the accelerator log database.

Highlights

  • CERN and high energy physics in general, have developed over the years many techniques to store and manage large amounts of data

  • Unlike physics data, which are stored on tape and on disk-based file systems, controls data coming from LHC subsystems are inserted into relational databases

  • Cloudera Impala is a storage engine on top of Hadoop Distributed File System (HDFS) that has demonstrated very good scalability in our tests. Spark is another engine that can be used on top of Hadoop and that has shown very good scalability for our use cases

Read more

Summary

Introduction

CERN and high energy physics in general, have developed over the years many techniques to store and manage large amounts of data. 3. Using of Hadoop for accelerating data analytics at CERN Given the variety of available solutions a series of tests and investigations have been performed to find a right set of technologies suitable for processing large time series datasets of interest for CERN use cases. Bucketing for an efficient data access Even though the Hadoop platform offers high scalable throughput for data access and processing there is a family of queries that will execute relatively slowly compared to RDBMS systems This applies to the cases when only one or few variables are being accessed. When grouping daily data into 10 buckets (by using mod(variable_id,10)) for the LHC logging service, a significant performance improvement is observed for single-variable data selection: only 4GB instead of 40GB needs to be read from daily partitions, which results in reduction of query execution time by a factor 10 (1s instead of 10s in our example)

Benefits from a columnar store
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.