Abstract

With the explosion of the number of distributed applications, a new dynamic server environment emerged grouping servers into clusters, whose utilization depends on the current demand for the application. To provide reliable and smooth services it is crucial to detect and fix possible erratic behavior of individual servers in these clusters. Use of standard techniques for this purpose delivers suboptimal results. We have developed a method based on machine learning techniques which allows detecting outliers indicating a possible problematic situation. The method inspects the performance of the rest of the cluster and provides system operators with additional information which allows them to identify quickly the failing nodes. We applied this method to develop a Spark application using the CERN MONIT architecture and with this application, we analyzed monitoring data from multiple clusters of dedicated servers in the CERN data center. In this contribution, we present our results achieved with this new method and with the Spark application for analytics of CERN monitoring data.

Highlights

  • In recent years the challenge of handling big volumes of data has triggered an ever growing production of distributed applications

  • If a suspected anomaly is found in one metric, the administrator needs to compare that to the others and hopefully discover the nature of the problem

  • Incorporating these metrics in the monitoring systems might be too time-consuming, considering that the lack of skilled administrators often leads to understaffed teams

Read more

Summary

Introduction

In recent years the challenge of handling big volumes of data has triggered an ever growing production of distributed applications. In particular noticing errors leading to performance degradation and potential failures can be difficult, let alone diagnosing problems and tracing them to a specific node or a set of nodes. When done manually, these procedures require experts to look through stacks of charts usually depicting multiple metrics per server. In an attempt to simplify administrators work, many applications offer a set of internal metrics describing their performance Incorporating these metrics in the monitoring systems might be too time-consuming, considering that the lack of skilled administrators often leads to understaffed teams. We discuss the efficiency of such an approach and present plans for future improvements

Distributed Applications at CERN
Preparation of the Input Data for Algorithms
Analyzing the Data
Conclusion and Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.