Abstract

Datacenters are characterized by large scale, stringent reliability requirements, and significant application diversity. However, the realities of employing hardware with non-zero failure rates mean that datacenters are subject to significant numbers of failures that can impact performance. Moreover, failures are not always obvious; network components can fail partially, dropping or delaying only subsets of packets. Thus, traditional fault detection techniques involving end-host or router-based statistics can fall short in their ability to identify these errors. We describe how to expedite the process of detecting and localizing partial datacenter faults using an end-host method generalizable to most datacenter applications. In particular, we correlate end-host transport-layer flow metrics with per-flow network paths and apply statistical analysis techniques to identify outliers and localize faulty links and/or switches. We evaluate our approach in a production Facebook front-end datacenter, focusing on its effectiveness across a range of traffic patterns.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call