Three levels of fail-safe mode in MPI I/O NVRAM distributed cache

Artur Malinowski,Paweł Czarnul

doi:10.1016/j.procs.2018.08.237

Abstract

The paper presents architecture and design of three versions for fail-safe data storage in a distributed cache using NVRAM in cluster nodes. In the first one, cache consistency is assured through additional buffering write requests. The second one is based on additional write log managers running on different nodes. The third one benefits from synchronization with a Parallel File System (PFS) for saving data into a new file which allows to keep file history at the cost of space. We have shown that the three level fail-safe mode incorporating these versions does introduce minimal overhead for a random walk microbenchmark application for a 1GB file and checkpoints created every 2000 iterations, computing powers of a graph with 10000 vertices and up to 20% overhead for parallel processing of images up to 1000 megapixels compared to the basic NVRAM cache without fail-safe modes. We also presented times for checkpoint creation and restoring for sizes up to 10GBs.

Full Text