SciAP: A Programmable, High-Performance Platform for Large-Scale Scientific Data

Yang Tian,Chao Li,Haihua Yan,Chao Liu

doi:10.1109/iccbb.2018.8756433

Abstract

Scientific instruments and computer simulations such as satellite feeds, medical informatics and bioinformatics research are creating massive amount of data which requires technology innovation to reveal the underlying structure and facilitate decision making. However, storage capacity, analytical accuracy and processing efficiency in scientific research field are not coping with the exponential data growth. As the multidimensional data structure and the exclusive indexing method raise the difficulties in promoting parallel I/O and unified processing, and it lacks out-of-the-box interoperability between large-scale scientific data and big data technologies. In order to address these issues, we present SciAP, a programmable, high-performance platform for large-scale scientific data. SciAP enables specific-domain scientists to natively execute Spark programs and applications for processing and analyzing scientific data on HPC environment, and uses model-driven way to extract abstract models from heterogeneous scientific data formats, ultimately provides a unified interface to access scientific raw data. We integrate an auto partitioning algorithm to determine the data partitioning layout based on scientific meta data and connect with Spark RDDs structure to specify task granularity and navigate parallel I/O. Experiment evaluation shows SciAP achieved an overall improvement of 2.1x over Spark range partitioning way, and 2.3x speedups over serial implementation in Prestack Kirchhoff Time Migration algorithm.

Full Text