FinOps-driven optimization of cloud resource usage for high-performance computing using machine learning

Piotr Nawrocki,Mateusz Smendowski

doi:10.1016/j.jocs.2024.102292

Abstract

Cloud computing is gaining popularity in high-performance computing applications. Its utilization enables advanced simulations when local computing resources are limited. However, cloud usage may increase costs and entail resource unavailability risks. This article presents an original approach that employs machine learning to predict long-term cloud resource usage. This enables optimizing resource utilization through appropriate reservation plans, reducing the associated costs. The solution developed utilizes statistical models, XGBoost, neural networks and the Temporal Fusion Transformer. Long-term prediction of cloud resource consumption, especially the Cloud Resource Usage Optimization System that is critical for prolonged simulations, involves using prediction results to dynamically create resource reservation plans across various virtual machine types for HPC on the Google Cloud Platform. Experiments using real-life production data demonstrate that the TFT prediction model improved prediction quality (by 31.4%) compared to the best baseline method, particularly in adapting to chaotic changes in resource consumption. However, it should be noted that the best prediction model in terms of error magnitude might not be the most suitable for resource reservation planning. This was validated by the neural network-based method, introducing an FR metric for forecast evaluation. Resource reservation plans were assessed both qualitatively and quantitatively, focusing on various aspects like a service-level agreement compliance and potential downtime. This paper is an extension of work originally presented during the International Conference on Computational Science — ICCS 2023, entitled “Long-Term Prediction of Cloud Resource Usage in High-Performance Computing”.

Full Text