An efficient real-time data collection framework on petascale systems

Huang Huang,Li-Qian Zhou,Yutong Lu,Tong Xiao,Can Leng,Chuanying Li,Zhe Quan

doi:10.1016/j.neucom.2019.06.039

Abstract

High efficiency data collection remains a great challenge for HPC reliability and resilience, yet may pave the way to overcome the barrier before fault prediction. Not only for increasing scalability up to exascale systems but even for contemporary supercomputer architectures does it require substantial efforts to efficiently collect and analyze data that contains system fault information within the framework of faults prediction. In this term, the article mainly focuses on efficient data collection and data preprocessing, preferring to optimize an effective framework to improve the efficiency of data collection in petascale system. The core of our framework includes a data collection acceleration layer scheduled by H2FS, a further detailed information get by performance analysis tool, as well as a new method for log template extraction, which all attribute to a more efficient and convenient framework for the real-time data collection. Hereafter, we conducted extensive tests based on a petascale system to verify the solution, and the experimental results demonstrate to be effectiveness and scalability of our framework.

Full Text