Abstract

Most public and private cloud providers have experienced failure in one of their services that may affect numerous applications and websites. Thus, in order to understand the causes of different types of failures and remediate the issue, failure analysis is one of the most critical steps. Failure analysis has been developed based on monitoring the most significant metrics of the system in order to study the behavior and frequency changes in the systems. Then, the monitored data will be stored in log files to be utilized for analysis and prediction tasks. In this paper, we primarily focus on analyzing and interpreting the characteristic behavior of finished/failed jobs in association with physically available resources using a publicly available dataset, Google cluster trace. The primary objective of our work is to enhance the understanding of job failure in cloud computing environments. Our results show a clear correlation between failed jobs and requested resources including memory, CPU, and disk space. Based on our results, we find that many techniques can be applied to increase the reliability and availability of cloud applications, such as developing scheduling algorithms, predicting job failure, limiting task resubmission or changing the priority policies.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.