Abstract

AbstractDue to the ever‐increasing number of computer nodes in distributed systems, efficient and effective tools have become crucial for their analysis. Although several efficient methods have been proposed to monitor and profile distributed systems, tracing remains the most effective solution for in‐depth system analysis. Tracing is the act of collecting a trace, which is a sequence of low‐level events generated by the kernel or the userspace. After data collection, the most important part is the event analysis. The paradigm and choice of graphs determine the ability of the user to detect abnormal behaviors and identify their root cause. Although tracing is a highly effective approach to analyzing complex systems, the scalability of the current analysis tools is limited. As a consequence, tracing is often impractical for large distributed systems. This paper identifies the shortcomings of the current approaches, most notably the critical path computation and the trace file transfer between nodes. Then, this paper proposes new solutions to these drawbacks, most notably a distributed algorithm to compute the critical path, that does not aggregate all traces in a single node, and an efficient architecture to perform tracing on distributed systems. These new solutions are made publically available.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call