A fault-tolerant hierarchical diagnostic network for massively parallel processing systems

Yoon-Hwa Choi,Yu-Seok Kim

doi:10.1016/s0045-7906(98)00007-x

Abstract

Massively parallel processing systems consist of a large number of processing nodes to provide high performance primarily for data-intensive applications. In a system of such dimensions, high availability cannot be achieved without relying on redundancy and reconfiguration. An important aspect of highly available design is rapid diagnosis and graceful degradation in the event of failures. This paper presents a hierarchical diagnostic network for locating faults in parallel processor systems comprised of a large number of identical processing nodes. In the case of a single fault, the network can locate the fault at the time it is detected. Even in the case of multiple faults, it can significantly reduce the test time as compared to the well-known binary search. Unlike the existing self-diagnostic circuits, the diagnostic network requires small hardware overhead and may tolerate a fault in the network itself.

Full Text