Performance Evaluations of Distributed File Systems for Scientific Big Data in FUSE Environment

Jun-Yeong Lee,Syed Asif Raza Shah,Seo-Young Noh,Moon-Hyun Kim,Heejun Yoon,Sang-Un Ahn

doi:10.3390/electronics10121471

Abstract

Data are important and ever growing in data-intensive scientific environments. Such research data growth requires data storage systems that play pivotal roles in data management and analysis for scientific discoveries. Redundant Array of Independent Disks (RAID), a well-known storage technology combining multiple disks into a single large logical volume, has been widely used for the purpose of data redundancy and performance improvement. However, this requires RAID-capable hardware or software to build up a RAID-enabled disk array. In addition, it is difficult to scale up the RAID-based storage. In order to mitigate such a problem, many distributed file systems have been developed and are being actively used in various environments, especially in data-intensive computing facilities, where a tremendous amount of data have to be handled. In this study, we investigated and benchmarked various distributed file systems, such as Ceph, GlusterFS, Lustre and EOS for data-intensive environments. In our experiment, we configured the distributed file systems under a Reliable Array of Independent Nodes (RAIN) structure and a Filesystem in Userspace (FUSE) environment. Our results identify the characteristics of each file system that affect the read and write performance depending on the features of data, which have to be considered in data-intensive computing environments.

Highlights

As the amount of computing data increases, the importance of data storage is emerging.Research from IDC and Seagate predicted that the size of the global data sphere was only a few ZB in 2010, but it would increase to 175 ZB by 2025 [1]
Reliable Array of Independent Nodes (RAIN) layout, the benchmark results of three distributed file systems excluding Luster are shown as graphs because Luster does not support the corresponding layout
One of the important characteristics is the I/O patterns shown in data that should be analyzed in such a scientific computing environment

Summary

Introduction

As the amount of computing data increases, the importance of data storage is emerging.Research from IDC and Seagate predicted that the size of the global data sphere was only a few ZB in 2010, but it would increase to 175 ZB by 2025 [1]. Due to the tremendous amount of experimental data produced, data storage is one of key factors in scientific computing. Rebuilding a RAID is likely to affect the stability of the RAID system, which may result in total data loss To overcome these drawbacks, many distributed file systems have been developed and deployed at many computing facilities for data-intensive research institutes. Some distributed file systems provide geo-replication, allowing data to be geographically replicated throughout the sites Due to these features, distributed file systems provide more redundancy than RAID storage systems. It is expected that the outcomes of our research can provide valuable insights which can help scientists when deploying distributed file systems in their data-intensive computing environments, considering the characteristics of their data

Objectives

Results

Conclusion