Abstract

High Performance Computing (HPC) systems are used in a variety of industrial and research sectors to solve complex problems that require powerful computing platforms. For these systems to remain reliable, we should be able to debug and analyze their behavior in order to detect root causes of potential poor performance. Execution traces hold important information regarding the events and interactions among communicating processes, which are essential for the debugging of inter-process communication. Traces, however, tend to be considerably large, hindering their applicability. In previous work, we presented an approach for automatically detecting communication patterns and segmenting large HPC traces into execution phases. The goal is to reduce the effort of analyzing traces by allowing software analysts to focus on smaller parts of interest. In this paper, we propose an approach for detecting and localizing inefficient communication patterns using statistical and trace segmentation methods. In addition, we use the Analytic Hierarchy Process to categorize slow communication patterns based on their severity and complexity levels. Using our approach, an analyst can quickly locate slow communication patterns that may be the cause of important performance problems. We show the effectiveness of our approach by applying it to large traces from three HPC systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call