Markov Chains for Fault-Tolerance Modeling of Stochastic Networks

Adam Meyers,Hui Yang

doi:10.1109/tase.2021.3093035

Abstract

Most real-world networks are time-varying, and many are subject to the stochastic functioning of their nodes and edges. Examples can be seen in the human brain undergoing an epileptic seizure, spontaneous infection and recovery in epidemics, and intermittent functioning of devices in the Internet of Things. Moreover, such networks are becoming increasingly large due to rapid technological advances. However, little has been done to study time-varying, large-scale, stochastic networks (SNs) from a reliability engineering perspective. Toward this goal, this article develops a fault-tolerance model for a type of time-varying network in which nodes (and/or edges) stochastically switch between active and inactive states. It considers fault tolerance from a global connectivity point of view, which has applications in many natural and engineered networks. Specifically, this article presents a Markov chain framework that models the dynamic behavior of nodes and allows for the computation of quantitative measures, including availability and time-to-failure metrics. To accommodate large-scale networks and emphasize global connectivity, this framework utilizes percolation theory, which has recently been of interest in the reliability engineering discipline, to characterize network failure. This article makes several contributions: it proposes a Markov chain framework for computing fault-tolerance metrics that is tractable for large-scale networks, it shows the existence of a phase transition in network availability of a time-varying SN, and it accounts for finite-size effects of percolation in the fault-tolerance model. The proposed methodology is applied to Erdös–Rényi random graphs and a real, large-scale power grid. Experimental results provide insights into network design, maintenance, and failure prevention of time-varying SNs. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Note to Practitioners</i> —This work develops a fault-tolerance model for time-varying stochastic networks in which nodes (and/or edges) randomly switch between active and inactive states. To address increasingly large-scale networks that are being studied, this article appeals to percolation theory. Fault tolerance is, thus, studied from a global connectivity perspective where the existence of a large connected component containing most of the nodes characterizes the functioning of the network. Specifically, this article presents a continuous-time Markov chain (CTMC) framework that models node dynamics and allows for the computation of fault-tolerance metrics, including network availability and mean time to failure. The proposed framework computes metrics efficiently for large networks and allows for studying their asymptotics. The percolation threshold describing the dissolution of the large connected component is used as the failure criterion in the CTMC. The practitioner should note that several assumptions are made in the proposed CTMC framework: nodes possess identical and time-invariant failure and recovery rates, node failure and recovery times are exponentially distributed, and node dynamics are independent of one another. In addition, fault-tolerance metrics computed for finite networks are estimators of the true metric values. The proposed framework is advantageous for quantifying the fault tolerance of large-scale, time-varying networks where the combinatorial explosion and a changing network topology pose challenges to the use of traditional reliability methods. A case study of a power grid network shows how to apply the proposed methodology to real networks.

Full Text