Abstract

Motivated by the growing complexity and heterogeneity of modern data centers, and the prevalence of commodity component failures, this article studies the failure-aware placement problem of placing tasks of a parallel job on machines in the data center with the goal of increasing availability. We consider two models of failures: adversarial and probabilistic. In the adversarial model, each node has a weight (higher weight implying higher reliability) and the adversary can remove any subset of nodes of total weight at most a given bound W and our goal is to find a placement that incurs the least disruption against such an adversary. In the probabilistic model, each node has a probability of failure and we need to find a placement that maximizes the probability that at least K out of N tasks survive at any time. For adversarial failures, we first show that (i) the problems are in Σ 2 , the second level of the polynomial hierarchy; (ii) a variant of the problem that we call R obust F ap (for Robust Failure-Aware Placement) is co-NP-hard; and (iii) an all-or-nothing version of R obust F ap is Σ 2 -complete. We then give a polynomial-time approximation scheme (PTAS) for R obust F ap , a key ingredient of which is a solution that we design for a fractional version of R obust F ap . We then study H ier R obust F ap , which is the fractional R obust F ap problem over a hierarchical network, in which failures can occur at any subset of nodes in the hierarchy, and a failure at a node can adversely impact all of its descendants in the hierarchy. To solve H ier R obust F ap , we introduce a notion of hierarchical max-min fairness and a novel Generalized Spreading algorithm, which is simultaneously optimal for every upper bound W on the total weight of nodes that an adversary can fail. These generalize the classical notion of max-min fairness to work with nodes of differing capacities, differing reliability weights, and hierarchical structures. Using randomized rounding, we extend this to give an algorithm for integral H ier R obust F ap . For the probabilistic version, we first give an algorithm that achieves an additive ϵ approximation in the failure probability for the single level version, called P rob F ap , while giving up a (1 + ϵ) multiplicative factor in the number of failures. We then extend the result to the hierarchical version, H ier P rob F ap , achieving an ϵ additive approximation in failure probability while giving up an (L + ϵ) multiplicative factor in the number of failures, where L is the number of levels in the hierarchy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call