Abstract

Large-scale computing centers supporting modern scientific experiments store and analyze vast amounts of data. A noticeable number of computing jobs executed within the complex distributed computing environments ends with errors of some kind, and the amount of error log data generated every day complicates manual analysis by human experts. Moreover, traditional methods such as specifying regular expression patterns to automatically group error messages become impractical in a heterogeneous computing environment without a well-defined structure of error messages. ClusterLogs framework for error message clustering was developed to address this challenge. Theframework can discover common patterns in error messages from various sources and group them together. One of the essential results of this process is the clear automated description of the resulting clusters, which will be used for the analysis. In this research, we propose that interpreting error messages as a natural language allows us to use transformer-based deep learning models such as BERT for this task. A model for extracting the relevant part of messages was trained and integrated into ClusterLogs to represent each cluster as a few actionable items, ensuring better interpretation and validation of the results of clustering.

Highlights

  • One of the most important goals of the monitoring system of large-scale distributed computing systems is to detect and analyze faults and errors in the infrastructure

  • It is often useful to investigate the distribution of different types of errors: detect the most frequent error patterns for some period of time, carry out a retrospective analysis of these patterns and discover common characteristics of computing jobs that finished with particular failures

  • It is devoted to the automation of error message analysis tasks, allowing the human experts to investigate different types of errors and discover common characteristics of the jobs that resulted in those errors

Read more

Summary

Introduction

One of the most important goals of the monitoring system of large-scale distributed computing systems is to detect and analyze faults and errors in the infrastructure. Due to the scale of the problem significant resources are needed to solve it, and today the error message analysis is only partially automated, mostly in cases such as the error message format being known in advance. Another important aspect of the problem is being able to analyze textual patterns, and link them to the metadata such as job identificators in workload management systems, or computing site and launch parameters for the particular computing job. This paper describes the improvement of the cluster description as well as optional improvement of the clustering results using the BERT model to extract the significant parts of the messages

Error message clustering overview
Relevant part extraction model
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.