Towards Exploring Data-Intensive Scientific Applications at Extreme Scales through Systems and Simulations

Dongfang Zhao,Ioan Raicu,Ning Liu,Dries Kimpe,Xian-He Sun,Robert Ross

doi:10.1109/tpds.2015.2456896

Abstract

The state-of-the-art storage architecture of high-performance computing systems was designed decades ago, and with today's scale and level of concurrency, it is showing significant limitations. Our recent work proposed a new architecture to address the I/O bottleneck of the conventional wisdom, and the system prototype (FusionFS) demonstrated its effectiveness on up to 16 K nodes-the scale on par with today's largest supercomputers. The main objective of this paper is to investigate FusionFS's scalability towards exascale. Exascale computers are predicted to emerge by 2018, comprising millions of cores and billions of threads. We built an event-driven simulator (FusionSim) according to the FusionFS architecture, and validated it with FusionFS's traces. FusionSim introduced less than 4 percent error between its simulation results and FusionFS traces. With FusionSim we simulated workloads on up to two million nodes and find out almost linear scalability of I/O performance; results justified FusionFS's viability for exascale systems. In addition to the simulation work, this paper extends the FusionFS system prototype in the following perspectives: (1) the fault tolerance of file metadata is supported, (2) the limitations of the current system design is discussed, and (3) a more thorough performance evaluation is conducted, such as N-to-1 metadata write, system efficiency, and more platforms such as Amazon Cloud.

Full Text