Parallel Query Service for Object-centric Data Management Systems

Houjun Tang,Quincey Koziol,Suren Byna,Bin Dong

doi:10.1109/ipdpsw50202.2020.00076

Abstract

While large-scale scientific experiments and simulations produce massive amounts of data, a small fraction of data contains useful information. Efficient querying on such volume of data to extract that information increases the productivity of the scientific discovery process. Although querying has been explored extensively in relational databases, research and adoption of querying tools for scientific data that is stored in parallel file systems on high performance computing (HPC) systems are still in infancy. In this paper, we introduce a parallel query service, called PDC-Query, for an object data management systems (ODMS) on HPC systems. It operates on partitioned objects in parallel, and provides several optimization strategies for fast query evaluation. The ODMS paradigm for HPC systems is promising in reducing the burden on users in data management and in moving data transparently across the deep memory hierarchy in modern HPC systems. We propose a `global histogram' based approach to accelerate query evaluation, through selectivity estimation and reducing the amount of data that needs to be loaded from storage and processed. We compare querying performance and demonstrate the efficiency and scalability of different approaches PDC-Query supports, including using global histograms, bitmap indexes, sorting, and full scan, in performing various queries on top of a plasma physics dataset with 125 billion particles and an astronomy dataset with 25 million objects.

Full Text