Cloud application deployment with transient failure recovery

Ioannis Giannakopoulos,Ioannis Konstantinou,Dimitrios Tsoumakos,Nectarios Koziris

doi:10.1186/s13677-018-0112-9

Ioannis Giannakopoulos, Ioannis Konstantinou + Show 2 more

Open Access

https://doi.org/10.1186/s13677-018-0112-9

Copy DOI

Abstract

Application deployment is a crucial operation for modern cloud providers. The ability to dynamically allocate resources and deploy a new application instance based on a user-provided description in a fully automated manner is of great importance for the cloud users as it facilitates the generation of fully reproducible application environments with minimum effort. However, most modern deployment solutions do not consider the error-prone nature of the cloud: Network glitches, bad synchronization between different services and other software or infrastructure related failures with transient characteristics are frequently encountered. Even if these failures may be tolerable during an application’s lifetime, during the deployment phase they can cause severe errors and lead it to failure. In order to tackle this challenge, in this work we propose AURA, an open source system that enables cloud application deployment with transient failure recovery capabilities. AURA formulates the application deployment as a Directed Acyclic Graph. Whenever a transient failure occurs, it traverses the graph, identifies the parts of it that failed and re-executes the respective scripts, based on the fact that when the transient failure disappears the script execution will succeed. Moreover, in order to guarantee that each script execution is idempotent, AURA adopts a lightweight filesystem snapshot mechanism that aims at canceling the side effects of the failed scripts. Our thorough evaluation indicated that AURA is capable of deploying diverse real-world applications to environments exhibiting high error probabilities, introducing a minimal time overhead, proportional to the failure probability of the deployment scripts.

Highlights

The advent of the Cloud computing [1] era has been a key enabler for the migration of many applications from traditional, on-premise servers to public clouds, in order to fully exploit the advantages of the latter: Seemingly infinite resources, billed in a pay-as-you-go manner allow the cloud users to scale their applications, and do so in a cost-effective way
In this work we propose a cloud application deployment methodology that aims at identifying the parts of an application deployment that failed due to a transient failure and repeat them until the deployment is successfully accomplished
We suggest a lightweight filesystem snapshot mechanism in order to ensure that the configuration scripts are idempotent and, can be re-executed as many times as needed

Summary

Introduction

The advent of the Cloud computing [1] era has been a key enabler for the migration of many applications from traditional, on-premise servers to public clouds, in order to fully exploit the advantages of the latter: Seemingly infinite resources, billed in a pay-as-you-go manner allow the cloud users to scale their applications, and do so in a cost-effective way. During the application deployment phase, automation is essential in order to guarantee that different software components such as the cloud’s software stack, the Virtual Machines (VMs), external services, etc., will cooperate in a synchronous manner so as to successfully deploy a given application to the cloud This challenging task has been both an active research field [4,5,6,7,8,9,10,11] and the objective of many production systems, operated by modern cloud providers [12,13,14,15]. In spite of their differences, these systems share the same objective: They expect an application description along with any other runtime configuration option and they are responsible to allocate new resources, provision them and configure a new application instance in a fully automated manner

Methods

Findings

Conclusion