Improving the Efficiency of Markov Chain Analysis of Complex Distributed Systems

Fern Y. Hunt ,Christopher Dabrowski ,Katherine Morrison

doi:10.1002/https://dx.doi.org/10.6028/nist.ir.7744

Abstract

In large-scale distributed computing systems, the interactions of many independent components may lead to emergent global system behaviors with unforeseen, often detrimental, outcomes. The increasing economic importance of distributed systems such as cloud computing systems, grid computing systems, and the Internet, argues for developing analytical tools to understand, and predict, complex system behavior in order to ensure availability and reliability of computing services. In previous work, we described one such tool in which a piece-wise homogeneous Discrete Markov chain representation of a grid computing system can be systematically perturbed to predict situations that lead to marked performance degradations and system-wide failure. While the run times of the Markov chain model compared favorably with testbeds or detailed large-scale simulations, it was still often necessary to execute a sizable number of alternative perturbations of the model to identify scenarios in which system performance is likely to degrade or in which anomalous behaviors may occur. Here, we evolve our original approach and describe two novel methods for more quickly identifying portions of the Markov chain that are likely to be sensitive to perturbation. The first method involves finding cut sets, consisting of state transitions that effectively disconnect all paths in a Markov chain from the initial state to a desired end state. We show that by perturbing the state transitions in the cut set, it is possible to more quickly identify scenarios in which system performance is adversely affected. We also show this new method can be applied to larger Markov models than in our earlier work and therefore provides better scalability. We then present a second method, in which the Spectral Expansion Theorem is used to analyze the eigensystem of a set of Markov transition probability matrices (TPMs) in order to identify eigenvectors and eigenvalues that can be used to predict system performance. We describe how this second approach can also be used to indicate which state transitions, if perturbed, are likely to adversely affect system performance. Results are presented for both methods to show that they can be used to identify the same failure scenarios presented in our earlier paper (as well as additional scenarios, using the first method), while reducing the number of perturbations of the Markov model (or eliminating Markov simulation altogether, using the second method). We believe that these methods provide a basis for creating practical tools for analysis of complex systems and discuss future work towards this end.

Full Text