Abstract
The challenge of monitoring a computational center grows as the center deploys larger and more diverse systems. As system size grows, it becomes harder to discern the problem from the noise. Staff often experience alert fatigue, an occurrence when so many alerts come in that the actual problem is obscured by false alarms or by alarms for issues that are symptoms of the core problem. The National Energy Research Scientific Computing Center (NERSC) at the Lawrence Berkeley National Laboratory (LBNL) has begun to address this issue by ensuring that most alerts are actionable and that multiple alerts for common problems, such as node outages, do not arise. However, more work is needed for these solutions to be extensible to emerging extreme-scale systems. In this paper, we propose a framework for proactively monitoring and managing data center operations, capable of scaling to accommodate the heterogeneity and complexity of next-generation systems. We describe a new architecture for the Operations Monitoring and Notification Infrastructure (OMNI) at NERSC that enables proactive monitoring and management at scale by integrating state-of-the-art technology, such as Kubernetes, Prometheus, Grafana, and other predictive platforms with data from metrics, sensors, and analytics engines. The system will support the operation of the upcoming Perlmutter HPC system, to be delivered in late 2020, as well as NERSC's successive computational system deployments. This comprehensive infrastructure will assist in centrally orchestrating services and deployments, automatically analyzing streaming data, correlating multiple-sourced data, and thresholding alerts to identify core issues from a single view.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.