Abstract

Many organizations are moving their systems to the cloud, where providers consolidate multiple clients using virtualization, which creates challenges to business-critical applications. Research has shown that hypervisors fail, often causing common-mode failures that may abruptly disrupt dozens of virtual machines simultaneously. We hypothesize and empirically show that a significant percentage of virtual machines affected by a hypervisor failure are capable of continuing execution on a new hypervisor. Supported by this observation, we design a technique for recovering from hypervisor failures through efficient virtual machine migration to a co-located hypervisor, which allows virtual machines to continue executing with minimal downtime and which can be transparently applied to existing applications. We evaluate a proof-of-concept implementation using fault injection of hardware and software faults and show that it can recover, on average, 41-46% of all virtual machines, as well as having a mean virtual machine downtime of 3 seconds.

Highlights

  • Cloud computing infrastructures provide elastic resources to organizations, enabling them to deploy scalable online applications and services while reducing the fixed costs of IT infrastructures [1]

  • Fault injection experiments presented in this paper show that our hypothesis holds and suggest that virtual machines (VMs) can be recovered after hypervisor failures

  • The experiments measure recovery effectiveness, migration time, downtime and runtime overhead

Read more

Summary

Introduction

Cloud computing infrastructures provide elastic resources to organizations, enabling them to deploy scalable online applications and services while reducing the fixed costs of IT infrastructures [1]. Virtualization is one of the enabling technologies supporting cloud computing initiatives. Cloud providers rent their physical infrastructure to multiple tenants, using virtualization to execute up to hundreds of virtual machines (VMs) on a single, powerful physical machine [4]. This is a very cost-effective approach, it creates the risk of common-mode failures [5], which have been observed in

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call