Production Application Performance Data Streaming for System Monitoring

Ramin Izadpanah,Damian Dechev,Benjamin A Allan,Jim Brandt

doi:10.1145/3319498

Abstract

In this article, we present an approach to streaming collection of application performance data. Practical application performance tuning and troubleshooting in production high-performance computing (HPC) environments requires an understanding of how applications interact with the platform, including (but not limited to) parallel programming libraries such as Message Passing Interface (MPI). Several profiling and tracing tools exist that collect heavy runtime data traces either in memory (released only at application exit) or on a file system (imposing an I/O load that may interfere with the performance being measured). Although these approaches are beneficial in development stages and post-run analysis, a systemwide and low-overhead method is required to monitor deployed applications continuously. This method must be able to collect information at both the application and system levels to yield a complete performance picture.In our approach, an application profiler collects application event counters. A sampler uses an efficient inter-process communication method to periodically extract the application counters and stream them into an infrastructure for performance data collection. We implement a tool-set based on our approach and integrate it with the Lightweight Distributed Metric Service (LDMS) system, a monitoring system used on large-scale computational platforms. LDMS provides the infrastructure to create and gather streams of performance data in a low overhead manner. We demonstrate our approach using applications implemented with MPI, as it is one of the most common standards for the development of large-scale scientific applications.We utilize our tool-set to study the impact of our approach on an open source HPC application, Nalu. Our tool-set enables us to efficiently identify patterns in the behavior of the application without source-level knowledge. We leverage LDMS to collect system-level performance data and explore the correlation between the system and application events. Also, we demonstrate how our tool-set can help detect anomalies with a low latency. We run tests on two different architectures: a system enabled with Intel Xeon Phi and another system equipped with Intel Xeon processor. Our overhead study shows our method imposes at most 0.5% CPU usage overhead on the application in realistic deployment scenarios.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Production Application Performance Data Streaming for System Monitoring

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Lead the way for us

Journal: ACM Transactions on Modeling and Performance Evaluation of Computing Systems	Publication Date: Apr 23, 2019
Citations: 2

Similar Papers

GNAQPMS v1.1: accelerating the Global Nested Air Quality Prediction Modeling System (GNAQPMS) on Intel Xeon Phi processors
Hui Wang ... Zifa Wang
Geoscientific Model Development | VOL. 10
Hui Wang, et. al.Hui Wang ... Zifa Wang
01 Aug 2017
Geoscientific Model Development | VOL. 10

On the Mitigation of Cache Hostile Memory Access Patterns on Many-Core CPU Architectures
Tom Deakin ... Simon Mcintosh-Smith
-
Tom Deakin, et. al.Tom Deakin ... Simon Mcintosh-Smith
01 Jan 2017
01 Jan 2017

Parallelized Simulation of a Finite Element Method in Many Integrated Core Architecture
Moonho Tak ... Taehyo Park
Journal of Engineering Materials and Technology | VOL. 139
Moonho Tak, et. al.Moonho Tak ... Taehyo Park
07 Feb 2017
Journal of Engineering Materials and Technology | VOL. 139

Parallel BRDF-based infrared radiation simulation of aerial targets implemented on Intel Xeon processor and Xeon Phi coprocessor
Xing Guo ... Yunhua Cao
Journal of Real-Time Image Processing | VOL. 16
Xing Guo, et. al.Xing Guo ... Yunhua Cao
07 Dec 2017
Journal of Real-Time Image Processing | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Production Application Performance Data Streaming for System Monitoring

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Modeling and Performance Evaluation of Computing Systems