Abstract

Predicting resource usage of workloads in large scale production clusters is very important to understand the characteristics of applications. It is also very important for cluster operators to manage cluster resources more efficiently. Traditional statistical-based prediction methods face challenges in predicting resource usage in large scale dynamic and complex clusters. The current commonly used deep learning methods such as Recurrent Neural Networks (RNN) usually use the historical data in single node to predict the future resource usage. While most of the modern applications (e.g., microservices) are distributively deployed to the cluster, the traditional single node resource prediction methods cannot predict the resource usages well. To solve this problem, in this paper we propose a new deep learning model called GraphGRU which is based on graph attention networks (GAT) to predict the resource usages from the cluster perspective. We use the Dynamic Time Warping (DTW) algorithm to construct a graph structure for multiple physical nodes in the cluster and also use a method similar to data compensation to co-train the model with both horizontal and vertical data. We validate our model on the Alibaba microservices dataset which is captured from a large scale production cluster. Compared to the traditional deep learning methods, our model improves the prediction accuracy by up to 48.27%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call