Abstract

The analysis of self-stabilizing algorithms is often limited to the worst case stabilization time starting from an arbitrary state, i.e., a state resulting from a sequence of faults. Considering the fact that these algorithms are intended to provide fault tolerance in the long run, this is not the most relevant metric. A common situation is that a running system is an a legitimate state when hit by a single fault. This event has a much higher probability than multiple concurrent faults. Therefore, the worst case time to recover from a single fault is more relevant than the recovery time from a large number of faults. This paper presents techniques to derive upper bounds for the mean time to recover from a single fault for self-stabilizing algorithms based on Markov chains in combination with lumping. To illustrate the applicability of the techniques they are applied to a new self-stabilizing coloring algorithm.

Highlights

  • Fault tolerance aims at making distributed systems more reliable by enabling them to continue the provision of services in the presence of faults

  • In particular we demonstrate how lumping can be applied to reduce the complexity of the Markov chains

  • The analysis of self-stabilizing algorithms is often confined to the stabilization time starting from an arbitrary configuration

Read more

Summary

Introduction

Fault tolerance aims at making distributed systems more reliable by enabling them to continue the provision of services in the presence of faults. Self-stabilizing algorithms belong to the category of distributed algorithms that provide non-masking fault tolerance They guarantee that systems eventually recover from transient faults of any scale such as perturbations of the state in memory or communication message corruption [2]. The containment time of A denotes the worst-case number of rounds any execution of A starting at a 1-faulty configuration needs to reach a legitimate configuration. The reason is that a distributed system consists of independently operating computers where transient faults such as memory faults in different computers are independent events Considering this fact it comes as a surprise that most papers consider only arbitrary initial states (i.e., k-faulty configuration for any k) instead of focusing on 1-faulty configuration. We believe that the techniques can be applied to other algorithms

Related Work
System Model
Contamination Radius
Containment Time
Self-Stabilizing Algorithms and Markov Chains
Algorithm Acol
Fault Containment Time of Algorithm Acol
Message Corruption
Memory Corruption
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call