Enhancing reliability and response times via replication in computing clusters

Zhan Qiu,Juan F Perez

doi:10.1109/infocom.2015.7218512

Abstract

Computing clusters have been widely deployed for scientific and engineering applications to support intensive computation and massive data operations. As applications and resources in a cluster are subject to failures, fault-tolerance strategies are commonly adopted, sometimes at the expense of additional delays in job response times, or unnecessarily increasing resource usage. In this paper, we explore concurrent replication with canceling, a fault-tolerance approach where jobs and their replicas are processed concurrently, and the successful completion of either triggers the removals of its replica. We propose a stochastic model to study how this approach affects the cluster service level objectives (SLOs), particularly the offered response time percentiles. In addition to the expected gains in reliability, the proposed model allows us to determine the regions of the utilization where introducing replication with canceling effectively reduces the response times. Moreover, we show how this model can support resource provisioning decisions with reliability and response time guarantees.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Enhancing reliability and response times via replication in computing clusters

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Characterizing Service Level Objectives for Cloud Services: Realities and Myths
Jianru Ding ... Christopher Stewart
-
Jianru Ding, et. al.Jianru Ding ... Christopher Stewart
01 Jun 2019
01 Jun 2019

Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments
Dawei Sun ... Changsheng Miao
The Journal of Supercomputing | VOL. 66
Dawei Sun, et. al.Dawei Sun ... Changsheng Miao
21 Mar 2013
The Journal of Supercomputing | VOL. 66

Abstract 249: The Effect of Response Time on Out-Of-Hospital Cardiac Arrest Survival Varies by Patient Subpopulation
Clara Stoesser ... Dennis Ko
Circulation | VOL. 138
Clara Stoesser, et. al.Clara Stoesser ... Dennis Ko
06 Nov 2018
Circulation | VOL. 138

Limited Busy Periods in Response Time Analysis for Tasks Under Global EDF Scheduling
Quan Zhou ... Guohui Li
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | VOL. 40
Quan Zhou, et. al.Quan Zhou ... Guohui Li
15 May 2020
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | VOL. 40

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Enhancing reliability and response times via replication in computing clusters

Abstract

Talk to us

Similar Papers