Understanding the latency distribution of cloud object storage systems

Yi Su,Zhan Shi,Dan Feng,Yu Hua

doi:10.1016/j.jpdc.2019.01.008

Abstract

As a fundamental cloud service, the cloud object storage system stores and retrieves millions or even billions of read-heavy data objects. Serving for a massive amount of requests each day makes the response latency be a vital component of user experiences. Timeout is also a key issue as it has a great impact on the response latency. Due to the lack of suitable understanding on the distribution of the response latency and the occurrence of timeouts, current practice is to use overprovision resources to meet a Service Level Agreement (SLA) on response latency. Hence, firstly, we build a performance model for the cloud object storage system, which assumes no timeout occurring. Our model predicts the percentage of requests meeting an SLA, in the context of complicated disk operations, event-driven programming model and requests waiting for being accept()-ed. Secondly, we propose a method that determines whether or not our model is applicable by predicting the occurrence of timeouts. We evaluate our model with a production system using a real-world trace. In a variety of scenarios, our model reduces the prediction errors by up to 90% compared with baseline models, and its overall average error is 2.63%. Moreover, we could also accurately predict the applicability of our model.

Full Text