Abstract

Stragglers are exceptionally slow tasks within a job that delay its completion. Stragglers, which are uncommon within a single job, are pervasive in datacenters with many jobs. We present Hound, a statistical machine learning framework that infers the causes of stragglers from traces of datacenter-scale jobs. Hound is designed to achieve several objectives: datacenter-scale diagnosis, unbiased inference, interpretable models, and computational efficiency. We demonstrate Hound's capabilities for a production trace from Google's warehouse-scale datacenters and two Spark traces from Amazon EC2 clusters.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call