Abstract

With the exploding of data-intensive web applications and requests (tasks), geo-distributed and large-scale data centers (DCs) are widely deployed in Software as a Service (SaaS) cloud, but server failures continue to grow at the same time. In this context, task scheduling problems become more intricate and both scheduling quality and scheduling speed raise further concerns. In this paper, we first propose a virtualized & monitoring SaaS model with predictive maintenance to minimize the costs of fault tolerance. Then with the monitored and predicted available states of servers, we focus on dynamic real-time task scheduling in geo-distributed and large-scale DCs with heterogeneous servers. Multiple objectives, including the long-term performance benefits, energy and communication costs, are taken into consideration in order to improve scheduling quality. For inter-DC and intra-DC task scheduling, two dynamic programming problems are formulated respectively, but there exists the problem that both state and action spaces are too large to be solved by simple iterations. To address this issue, we introduce the idea of reinforcement learning theory into solving traditional stochastic dynamic programming problems in the large-scale SaaS cloud, and put forward a cascaded two-level (inter-DC and intra-DC level) approximate dynamic programming (ADP) task-scheduling algorithm. The computation complexity can be significantly reduced and scheduling speed can be greatly improved. Finally, we conduct experiments with both random simulation data and Google cloud trace-logs. QoS evaluations and comparisons demonstrate that two ADP algorithms can work cooperatively, and our two-level ADP algorithm is more effective under large quantity of bursty requests.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call