Abstract

With the explosion of the number of distributed applications, a new dynamic server environment emerged grouping servers into clusters, utilization of which depends on the current demand for the application. To provide reliable and smooth services it is crucial to detect and fix possible erratic behavior of individual servers in these clusters. Use of standard techniques for this purpose requires manual work and delivers sub-optimal results. Using only application agnostic monitoring metrics our machine learning based method analyzes the recent performance of the inspected server as well as the state of the rest of the cluster, thus checking not only the behavior of the single server, but the load on the whole distributed application as well. We have implemented our method in a Spark job running in the CERN MONIT infrastructure. In this contribution we present results of testing multiple machine learning algorithms and pre-processing techniques to identify the servers erratic behavior. We also discuss the challenges of deploying our new method into production.

Highlights

  • In the last few decades the amount of digitally saved data has been growing exponentially [? ]

  • In 2014 the world’s technological capacity to store information has reached almost 5 zettabytes [1]. Handling this incredible amount of incoming data requires innovative techniques increasingly leveraging horizontal scaling; an approach utilizing many computers instead of one more powerful. Such novel approaches create additional concerns for the system administrators, when it comes to noticing errors that pose a threat to the efficiency and availability of the application

  • We present a process of acquiring and processing a stream of raw monitoring data in the MONIT [2] infrastructure

Read more

Summary

Introduction

In the last few decades the amount of digitally saved data has been growing exponentially [? ]. In 2014 the world’s technological capacity to store information has reached almost 5 zettabytes [1] Handling this incredible amount of incoming data requires innovative techniques increasingly leveraging horizontal scaling; an approach utilizing many computers instead of one more powerful. Traditional monitoring methods require lots of manual labor when applied to this problem, developers of open source monitoring system have been so far reluctant to include any advanced tools In this project we set to explore the possibility of using machine learning to spot erratic servers within a cluster running a distributed application. In an attempt to simplify administrators work, many applications offer a set of internal metrics describing their performance Incorporating these metrics in the existing monitoring systems might be too time-consuming, considering that the lack of skilled administrators often leads to understaffed teams. We discuss the efficiency of such an approach, benchmark the core model and present plans for future development

Monitoring Systems Overview
Data Gathering
Creating Anomalies
Analysing the Data
Unsupervised learning
Conclusions and Future Work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call