Abstract Although high-performance computing (HPC) systems have been scaled to meet the exponentially growing demand for scientific computing, HPC performance variability remains a major challenge in computer science. Statistically, performance variability can be characterized by a distribution. Predicting performance variability is a critical step in HPC performance variability management. In this article, we propose a new framework to predict performance distributions. The proposed framework is a modified Gaussian process that can predict the distribution function of the input/output (I/O) throughput under a specific HPC system configuration. We also impose a monotonic constraint so that the predicted function is nondecreasing, which is a property of the cumulative distribution function. Additionally, the proposed model can incorporate both quantitative and qualitative input variables. We predict the HPC I/O distribution using the proposed method for the IOzone variability data. Data analysis results show that our framework can generate accurate predictions, and outperform existing methods. We also show how the predicted functional output can be used to generate predictions for a scalar summary of the performance distribution, such as the mean, standard deviation, and quantiles. Our prediction results can further be used for HPC system variability monitoring and optimization. This article has online supplementary materials.
Read full abstract