Predicting machine behavior from Google cluster workload traces

Adnan Umer,Adnan Noor Mian,Omer Rana

doi:10.1002/cpe.7559

Abstract

SummaryData centers today host a number of computational resources to support the increasing demand for computation and storage. Understanding how these physical and virtual machines transition between different states of operation (referred to as machine lifecycle) enables more efficient data center operation management. Furthermore, it helps data center operators define policies on how new computational resources can be added or existing infrastructure decommissioned. Using Google cluster trace data set version 3 collected from approximately 96 k machines, we analyze machine failure and changes in machine lifecycle over time. We observed that there is a 13% chance of another machine failure under the same network switch within 1 min of the previous machine failure. A Markov chain‐based model is proposed, that can predict machine states at any given time. Using the model and estimated probabilities, we predicted the machine state over a span of several days with a high probability. Using the predicted machine state, we reconstructed the active machines trend and compared this with the trend reported in the data set, observing an error of 1.76%.

Full Text