ParSA: High-throughput scientific data analysis framework with distributed file system

Tao Zhang,Weimin Zheng,Xiangzheng Sun,Huang Huang,Jiwu Shu,Wei Xue,Nan Qiao

doi:10.1016/j.future.2014.10.015

Abstract

Scientific data analysis and visualization have become the key component for nowadays large scale simulations. Due to the rapidly increasing data volume and awkward I/O pattern among high structured files, known serial methods/tools cannot scale well and usually lead to poor performance over traditional architectures. In this paper, we propose a new framework: ParSA (parallel scientific data analysis) for high-throughput and scalable scientific analysis, with distributed file system. ParSA presents the optimization strategies for grouping and splitting logical units to utilize distributed I/O property of distributed file system, scheduling the distribution of block replicas to reduce network reading, as well as to maximize overlapping the data reading, processing, and transferring during computation. Besides, ParSA provides the similar interfaces as the NetCDF Operator (NCO), which is used in most of climate data diagnostic packages, making it easy to use this framework. We utilize ParSA to accelerate well-known analysis methods for climate models on Hadoop Distributed File System (HDFS). Experimental results demonstrate the high efficiency and scalability of ParSA, getting the maximum 1.3 GB/s throughput on a six nodes Hadoop cluster with five disks per node. Yet, it can only get 392 MB/s throughput on a RAID-6 storage node.

Full Text