Abstract
Root cause identification of performance degradation within distributed systems is often a difficult and time‐consuming task, yet it is crucial for maintaining high performance. In this paper, we present an execution trace‐driven solution that reduces the efforts required to investigate, debug, and solve performance problems found in multinode distributed systems. The proposed approach employs a unified analysis method to represent trace data collected from the user‐space level to the hardware level of involved nodes, allowing for efficient and effective root cause analysis. This solution works by extracting performance metrics and state information from trace data collected at user‐space, kernel, and network levels. The multisource trace data is then synchronized and structured in a multidimensional data store, which is designed specifically for this kind of data. A posteriori analysis using a top‐down approach is then used to investigate performance problems and detect their root causes. In this paper, we apply this generic framework to analyze trace data collected from the execution of the web server, database server, and application servers in a distributed LAMP (Linux, Apache, MySQL, and PHP) Stack. Using industrial level use cases, we show that the proposed approach is capable of investigating the root cause of performance issues, addressing unusual latency, and improving base latency by 70%. This is achieved with minimal tracing overhead that does not significantly impact performance, as well as O(log n) query response times for efficient analysis.
Highlights
When performance degradation occurs within a distributed system, it can have multiple causes
We investigated different areas with our proposed tool and identified the root cause of the problem
We have presented a unified analysis method for studying trace data gathered from different layers and sources
Summary
When performance degradation occurs within a distributed system, it can have multiple causes. It may be caused by insufficient system resources, a problem in the network layer, a bug in the software code within a connecting node, incorrect input data, or the misconfiguration of one of the active modules or nodes. Active monitoring of distributed system execution using runtime information can be helpful in this matter [1,2,3]. The runtime execution data, which is usually collected by logging and tracing tools, can help monitor the actual executions of systems, detect possible runtime problems, and hopefully pinpoint their root causes. Tracing is a method that consists of collecting execution logs from a system at runtime [4]. Unlike profiling, which usually provides statistics about a time range, tracing can display the state of the system at various levels, including the active processes, running system calls, function call stack, network usages, and active elements of disk queues at different time points, e.g., when a latency problem is detected in the system [5]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have