Abstract

SummaryMonitoring the state of an HPC cluster in a timely and accurate fashion is critical to most system administration functions. For many Cray users, the first step in monitoring is ingestion of log files. Unfortunately, log parsing is an inherently inefficient process, requiring multiple software components to read and write from files on disk. Cray's own utilities use a message bus, the Event Router Daemon (ERD), for a wide variety of purposes. At the Argonne Leadership Computing Facility (ALCF), we have begun to use this message bus for monitoring via a client library written in Go, allowing us to read in structured data directly from Cray's services and, in many instances, bypass log files entirely. In this paper, we will examine the implementation and utilization of this approach on our 4392 node XC40, Theta, as well as the overall benefits and drawbacks to using the ERD for real‐time monitoring.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.