Abstract

The increasing demand of the data center's computational capacity in recent years has introduced new data center operational challenges among others to maintain the service level agreements (SLA) and quality of services (QoS), while at the same time limiting energy consumption. In this paper, a stochastic operational risk assessment approach is presented that estimates the required number of spare servers in a data center considering the risk of servers' failure in operation since servers define the computational capability of a data center. A reliability index called “risk of computational resource commitment (RCRC)” is introduced that quantifies the probability of having insufficient spare servers due to failures during the operational lead time, and the complement of the RCRC shows the ability of the resources to maintain SLA of a data center. The failure rates of the servers are obtained using a Monte Carlo Simulation with the failure data, published by Google in 2019. The analysis shows that the RCRC reduces with the increasing number of spare servers, while it also stresses the energy efficiency of the data center. The RCRC index could be used in data center operation to avoid overprovisioning of the servers and to limit the number of spare servers in the data center, while creating a suitable balance between QoS and energy consumption of the data centers.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call