Abstract

A parallel application will terminate when a computational node fails. As the number of components in supercomputers increase and applications scale to use these systems, the mean time to failure decreases. Traditional fault tolerance approaches, such as checkpointing, are failing to scale. An alternative approach we explore in this paper is the use of VM-based live migration to move a process from a failing node to a healthy one to reduce the fault rate experienced by an application. We investigate the use of a virtualisation environment based on OpenVZ to perform live migrations of virtual machines on which multi-processor parallel applications are running. We explore the correctness, performance, security, and reliability of this approach along with the overhead of using OS-level virtualised systems for fault recovery. Our results confirm that it is possible to efficiently migrate virtual containers without affecting the correctness or completion of parallel applications.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call