TRANSFORMER-BASED MODEL FOR THE SEMANTIC PARSING OF ERROR MESSAGES IN DISTRIBUTED COMPUTING SYSTEMS IN HIGH ENERGY PHYSICS

D Grin,M Grigorieva

doi:10.54546/mlit.2021.19.82.001

Abstract

Large-scale computing centers supporting modern scientific experiments store and analyze vast amounts of data. A noticeable number of computing jobs executed within the complex distributed computing environments ends with errors of some kind, and the amount of error log data generated every day complicates manual analysis by human experts. Moreover, traditional methods such as specifying regular expression patterns to automatically group error messages become impractical in a heterogeneous computing environment without a well-defined structure of error messages. ClusterLogs framework for error message clustering was developed to address this challenge. Theframework can discover common patterns in error messages from various sources and group them together. One of the essential results of this process is the clear automated description of the resulting clusters, which will be used for the analysis. In this research, we propose that interpreting error messages as a natural language allows us to use transformer-based deep learning models such as BERT for this task. A model for extracting the relevant part of messages was trained and integrated into ClusterLogs to represent each cluster as a few actionable items, ensuring better interpretation and validation of the results of clustering.

Highlights

One of the most important goals of the monitoring system of large-scale distributed computing systems is to detect and analyze faults and errors in the infrastructure
It is often useful to investigate the distribution of different types of errors: detect the most frequent error patterns for some period of time, carry out a retrospective analysis of these patterns and discover common characteristics of computing jobs that finished with particular failures
It is devoted to the automation of error message analysis tasks, allowing the human experts to investigate different types of errors and discover common characteristics of the jobs that resulted in those errors

Summary

Introduction

One of the most important goals of the monitoring system of large-scale distributed computing systems is to detect and analyze faults and errors in the infrastructure. Due to the scale of the problem significant resources are needed to solve it, and today the error message analysis is only partially automated, mostly in cases such as the error message format being known in advance. Another important aspect of the problem is being able to analyze textual patterns, and link them to the metadata such as job identificators in workload management systems, or computing site and launch parameters for the particular computing job. This paper describes the improvement of the cluster description as well as optional improvement of the clustering results using the BERT model to extract the significant parts of the messages

Error message clustering overview

Relevant part extraction model

Conclusion