Leveraging Code Snippets to Detect Variations in the Performance of HPC Systems

Jidong Zhai,Liyan Zheng,Jinghan Sun,Xuehai Qian,Wei Xue,Xiongchao Tang,Feng Zhang,Bingsheng He,Weimin Zheng,Wenguang Chen

doi:10.1109/tpds.2022.3158742

Abstract

Variations in the performance of parallel and distributed systems are becoming increasingly challenging. The runtimes of different executions can vary greatly even with a fixed number of computing nodes. Many HPC applications on supercomputers exhibit such variance. This not only leads to unpredictable execution times, but also renders the system’s behavior unintuitive. The efficient online detection of variations in performance is an open problem in HPC research. To solve it, we propose an approach, called <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">vSensor</small> , to detect variations in the performance of systems. The key finding of this study is that the source code of programs can better represent performance at runtime than an external detector. Specifically, many HPC applications contain code snippets that are fixed workload patterns of execution, e.g., the workload of an invariant quantity and a linearly growing workload. This observation allows us to automatically identify these snippets of workload-related code and use them to detect variations in performance. We evaluate <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">vSensor</small> on the Tianhe-2A system with a large number of parallel applications, and the results indicate that it can efficiently identify variations in system performance. The average overhead of 4,096 processes is less than 6% for fixed-workload v-sensors. We identify a problematic node with slow memory by using <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">vSensor</small> that degrades the performance of the program by 21%. A serious issue with network performance is also detected that slows down the Tianhe-2A system by 3.37 times for an HPC kernel.

Full Text