Developing a Model for Holistic Workload Analysis of Large Supercomputer Systems

V.V. Voevodin,P.A. Shvets,S.A. Zhumatiy

doi:10.26089/nummet.v22r102

Abstract

Any modern supercomputer has an extremely complex architecture, and efficient usage of its resources is often a very difficult task, even for experienced users. At the same time, the field of high-performance computing is becoming more and more in demand, so the issue of efficient utilization of supercomputers is very urgent. Therefore, users should know everything important about performance of their jobs running on a supercomputer in order to be able to optimize them, and administrators should be able to monitor and analyze all the nuances of the efficient functioning of such systems. However, there is currently no complete understanding of what data are best to be studied (and how it should be analyzed) in order to have a whole picture of the state of the supercomputer and the processes taking place there. In this paper, we make our first attempt to answer this question. To do this, we are developing a model that describes all the potential factors that may be important when analyzing the performance of supercomputer applications and the HPC system as a whole. The paper provides both a detailed description of this model for users and administrators and some interesting real-life examples discovered on the Lomonosov-2 supercomputer using a software implementation based on the proposed model. Любой современный суперкомпьютер имеет крайне сложную архитектуру, и эффективное использование его ресурсов зачастую является очень сложной задачей даже для опытных пользователей. В то же время высокопроизводительные вычисления становятся все более востребованными и вопрос эффективного использования суперкомпьютеров очень актуален. Поэтому пользователи должны знать все самое важное о производительности их работы, выполняемой на суперкомпьютере, чтобы иметь возможность ее оптимизировать, а администраторы должны уметь отслеживать и анализировать все нюансы эффективного функционирования таких систем. Однако в настоящее время нет полного понимания того, какие данные лучше всего для этого изучать (и как их следует анализировать), чтобы иметь полную картину состояния суперкомпьютера и происходящих на нем процессов. В этой статье мы делаем нашу первую попытку ответить на этот вопрос. Для этого мы разрабатываем модель, которая описывает все потенциальные факторы, которые могут быть важными при анализе производительности суперкомпьютерных приложений и системы HPC в целом. В документе представлено как подробное описание этой модели для пользователей и администраторов, так и несколько интересных реальных примеров, обнаруженных на суперкомпьютере Ломоносов-2 с помощью программного обеспечения, реализованного на основе предложенной модели.

Full Text