Abstract

This article deals with the issue of fault tolerance and error recovery in a parallel graph reduction computer such as the "MaRS" machine presently under development at CERT. This is a multiprocessor system with decentralized control and asynchronous, delayed communications between cooperating, tightly coupled processes. A solution for the problem of MaRS error recovery is derived, based on the machine's execution model (successive reductions performed on the program graph, i.e. evaluations on the functional expression to be computed) and on its architectural organization (a number of reduction units and memory units interconnected by a message switching network). Under the basic assumption that the errors generated by faults in Reduction and Communication Processors can be detected and confined so as to avoid system contamination, it is shown that a coherent and errorfree recovery state can be restored. Although specifically developed for the MaRS machine, this solution is in principle applicable to other machines using the graph reduction model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call