Cloud workloads today are typically managed in a distributed environment and processed across geographically distributed data centers. Cloud service providers have been distributing data centers globally to reduce operating costs while also improving quality of service by using intelligent workload and resource management strategies. Such large scale and complex orchestration of software workload and hardware resources remains a difficult problem to solve efficiently. Researchers and practitioners have been trying to address this problem by proposing a variety of cloud management techniques. Mathematical optimization techniques have historically been used to address cloud management issues. But these techniques are difficult to scale to geo-distributed problem sizes and have limited applicability in dynamic heterogeneous system environments, forcing cloud service providers to explore intelligent data-driven and Machine Learning (ML) based alternatives. The characterization, prediction, control, and optimization of complex, heterogeneous, and ever-changing distributed cloud resources and workloads employing ML methodologies have received much attention in recent years. In this article, we review the state-of-the-art ML techniques for the cloud data center management problem. We examine the challenges and the issues in current research focused on ML for cloud management and explore strategies for addressing these issues. We also discuss advantages and disadvantages of ML techniques presented in the recent literature and make recommendations for future research directions.
Read full abstract