Abstract
Microservice architectures and container technologies are broadly adopted by giant internet companies to support their web services, which typically have a strict service-level objective (SLO), tail latency, rather than average latency. However, diagnosing SLO violations, e.g., long tail latency problem, is non-trivial for large-scale web applications in shared microservice platforms due to million-level operational data and complex operational environments. We identify a new type of tail latency problem for web services, small-window long-tail latency (SWLT), which is typically aggregated during a small statistical window (e.g., 1-minute or 1-second). We observe SWLT usually occurs in a small number of containers in microservice clusters and sharply shifts among different containers at different time points. To diagnose root-causes of SWLT, we propose an unsupervised and low-cost diagnosis algorithm-?-Diagnosis, using two-sample test algorithm and ?-statistics for measuring similarity of time series to identify root-cause metrics from millions of metrics. We implement and deploy a real-time diagnosis system in our real-production microservice platforms. The evaluation using real web application datasets demonstrates that ?-Diagnosis can identify all the actual root-causes at runtime and significantly reduce the candidate problem space, outperforming other time-series distance based root-cause analysis algorithms.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.