Abstract
With the explosion of the number of distributed applications, a new dynamic server environment emerged grouping servers into clusters, whose utilization depends on the current demand for the application. To provide reliable and smooth services it is crucial to detect and fix possible erratic behavior of individual servers in these clusters. Use of standard techniques for this purpose delivers suboptimal results. We have developed a method based on machine learning techniques which allows detecting outliers indicating a possible problematic situation. The method inspects the performance of the rest of the cluster and provides system operators with additional information which allows them to identify quickly the failing nodes. We applied this method to develop a Spark application using the CERN MONIT architecture and with this application, we analyzed monitoring data from multiple clusters of dedicated servers in the CERN data center. In this contribution, we present our results achieved with this new method and with the Spark application for analytics of CERN monitoring data.
Highlights
In recent years the challenge of handling big volumes of data has triggered an ever growing production of distributed applications
If a suspected anomaly is found in one metric, the administrator needs to compare that to the others and hopefully discover the nature of the problem
Incorporating these metrics in the monitoring systems might be too time-consuming, considering that the lack of skilled administrators often leads to understaffed teams
Summary
In recent years the challenge of handling big volumes of data has triggered an ever growing production of distributed applications. In particular noticing errors leading to performance degradation and potential failures can be difficult, let alone diagnosing problems and tracing them to a specific node or a set of nodes. When done manually, these procedures require experts to look through stacks of charts usually depicting multiple metrics per server. In an attempt to simplify administrators work, many applications offer a set of internal metrics describing their performance Incorporating these metrics in the monitoring systems might be too time-consuming, considering that the lack of skilled administrators often leads to understaffed teams. We discuss the efficiency of such an approach and present plans for future improvements
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.