Analysis of Job Failure and Prediction Model for Cloud Computing Using Machine Learning.

Mohammad S Jassas,Qusay H Mahmoud

doi:10.3390/s22052035

Abstract

Modern applications, such as smart cities, home automation, and eHealth, demand a new approach to improve cloud application dependability and availability. Due to the enormous scope and diversity of the cloud environment, most cloud services, including hardware and software, have encountered failures. In this study, we first analyze and characterize the behaviour of failed and completed jobs using publicly accessible traces. We have designed and developed a failure prediction model to determine failed jobs before they occur. The proposed model aims to enhance resource consumption and cloud application efficiency. Based on three publicly available traces: the Google cluster, Mustang, and Trinity, we evaluate the proposed model. In addition, the traces were also subjected to various machine learning models to find the most accurate one. Our results indicate a significant correlation between unsuccessful tasks and requested resources. The evaluation results also revealed that our model has high precision, recall, and F1-score. Several solutions, such as predicting job failure, developing scheduling algorithms, changing priority policies, or limiting re-submission of tasks, can improve the reliability and availability of cloud services.

Highlights

Fault tolerance for cloud computing can provide uninterrupted cloud services, even if one or more components can be failed for any reason
In Google cluster trace, we investigated why the number of failed tasks is very high compared to the number of failed jobs, so we found that some tasks are resubmitted thousands of times to be successfully finished
The results show that the Decision Trees (DTs) and Random Forest (RF) can reach the highest accuracy, precision, recall, and F1-score

Summary

Introduction

Fault tolerance for cloud computing can provide uninterrupted cloud services, even if one or more components can be failed for any reason. New modern IoT-Cloud applications, including smart cities and eHealth, require new design architectures that provide high reliability and availability. Both cloud providers and consumers are concerned about availability and reliability. The primary reason for this concern is that the cloud architecture is complex, which increases the probability of failure [1–3]. Cloud providers face reliability-related challenges that are dramatically similar to those encountered years ago. These challenges are power outages, unexpected hardware failures, failed deployments, software failures and human errors. The reliability and availability of cloud computing remain the main concern of cloud consumers. Amazon Web Services (AWS) has experienced a failure in one of its services, Elastic Block

Objectives

Methods

Discussion

Conclusion