Abstract
This paper describes two-fold approach towards utilizing Triple Modular Redundancy (TMR) in Wireless Adhoc Network (AdocNet). A distributed checkpointing and recovery protocol is proposed. The protocol eliminates useless checkpoints and helps in selecting only dependent processes in the concerned checkpointing interval, to recover. A process starts recovery from its last checkpoint only if it finds that it is dependent (directly or indirectly) on the faulty process. The recovery protocol also prevents the occurrence of missing or orphan messages. In AdocNet, a set of three nodes (connected to each other) is considered to form a TMR set, being designated as main, primary and secondary. A main node in one set may serve as primary or secondary in another. Computation is not triplicated, but checkpoint by main is duplicated in its primary so that primary can continue if main fails. Checkpoint by primary is then duplicated in secondary if primary fails too.
Highlights
Distributed systems that execute processes on different nodes connected by a communication network [6] are prone to failure
This concept of Triple Modular Redundancy (TMR) is utilized in this work as a measure for achieving fault tolerance in a wireless adhoc network (AdhocNet) where a group of three nodes, known as mobile hosts (MH) form the three replicas
The checkpointing algorithm proposed in this paper constructs consistent checkpoints in a distributed manner
Summary
Distributed systems that execute processes on different nodes connected by a communication network [6] are prone to failure. This concept of TMR is utilized in this work as a measure for achieving fault tolerance in a wireless adhoc network (AdhocNet) where a group of three nodes, known as mobile hosts (MH) form the three replicas. Fault tolerance may be achieved by periodically using stable storage of the MHs to save the process‘ states, better known as checkpoints, during failure-free execution. Processes take local checkpoints after being notified by the initiator excepting special cases described in later sections The processes synchronize their activities of the current checkpointing interval before committing their checkpoints. This paper describes that any global checkpoint taken in the above-mentioned fashion in the present system is consistent and eliminates taking unnecessary checkpoints and the system has to roll back only to the last saved state in case of a failure.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have