Abstract
Monitoring is an indispensable tool for the operation of any large installation of grid or cluster computing, be it high energy physics or elsewhere. Usually, monitoring is configured to collect a small amount of data, just enough to enable detection of abnormal conditions. Once detected, the abnormal condition is handled by gathering all information from the affected components. This data is processed by querying it in a manner similar to a database. This contribution shows how the metaphor of a debugger (for software applications) can be transferred to a compute cluster. The concepts of variables, assertions and breakpoints that are used in debugging can be applied to monitoring by defining variables as the quantities recorded by monitoring and breakpoints as invariants formulated via these variables. It is found that embedding fragments of a data extracting and reporting tool such as the UNIX tool awk facilitates concise notations for commonly used variables since tools like awk are designed to process large event streams (in textual representations) with bounded memory. A functional notation similar to both the pipe notation used in the UNIX shell and the point-free style used in functional programming simplify the combination of variables that commonly occur when formulating breakpoints.
Highlights
Large-scale computation is indispensable, in high energy physics, and in many other fields of physics, in most fields of science
Correctness and efficiency are intertwined, put, proper setup leads to correct, but not necessarily efficient behaviour, whereas proper operation should aim at optimising efficiency without harming correctness
To the best of the authors’ knowledge, there is no tool with the precise purpose of debugging a compute cluster
Summary
Large-scale computation is indispensable, in high energy physics, and in many other fields of physics, in most fields of science. Since typical installations consist of many moving parts, significant changes can be required for production use, thereby negating all advantages of simple (initial) deployment These observations about monitoring mean: Once a user or the monitoring system notice degradation in correct and/or efficient operation, the first step includes looking at the data stored by monitoring. A cluster debugger is a tool for interactive introspection of data similar to that collected by monitoring and for interaction with nodes based on the outcome of that introspection It does not necessarily introduce new capabilities relative to monitoring, instead, it caters to improved interactivity: The path between proposing a theory explaining the unwanted behaviour and finding evidence in favour or against the theory should consist of a few keystrokes at most
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have