Energy Efficient Fault Tolerance for High Performance Computing (HPC) in the Cloud

Ifeanyi P Egwutuoha,David Levy,Rafael Calvo,Bran Selic,Shiping Chen Shiping Chen

doi:10.1109/cloud.2013.69

Abstract

With cloud computing, a large number of Virtual Machines (VMs) can be provisioned to form high performance computing (HPC) to run computation-intensive applications using the Hardware as a Service (HaaS) model. Fault Tolerance (FT) for HPC in the cloud is increasingly a challenging issue, because any fault during the execution would result in re-running the application, which will cost time, money and energy. There has been a significant increase in energy consumption of HPC systems in cloud as a result of rerunning application and fault tolerance (e.g., redundant computing). In this paper we present energy efficient fault tolerance for HPC in the cloud. We develop a generic FT algorithm for HPC systems in the cloud. Our algorithm uses proactive processlevel migration approach, however it does not rely on a spare node or redundant computing prior to prediction of a failure. Our experimental results obtained from a real cloud execution environment show that the energy utilization for HPC in the cloud while providing fault tolerance can be reduced by as much as 30%.

Full Text