Fault-tolerant distributed shared memory on a broadcast-based interconnection architecture

Diana Lynn Hecht

doi:10.17918/etd-48

Abstract

This thesis focuses on the issue of reliability and fault tolerance in Distributed Shared Memory Multiprocessors, and on the performance impact of implementing fault tolerance protocols that allow for Backward Error Recovery through the use of synchronized checkpointing. High Performance Parallel computing systems that implement Distributed Shared Memory (DSM) require interconnection networks capable of providing low latency and high bandwidth and efficient support for multicast and synchronization operations. Software-based DSM systems rely on the operating system to manage the replicated memory pages and consequently their performance suffers due to operating system overhead, false sharing and page thrashing. In order to obtain high levels of performance, the activities related to maintaining the consistency of shared data in a DSM should be implemented in hardware so that latencies for data access can be minimized. The recoverable DSM system examined in this thesis is intended for the class of broadcast-based interconnection networks in order to provide the low latencies required for the application workloads characteristic of DSM. An example of this class of interconnection network is the Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus). The unique architecture of the SOME-Bus provides for strong integration of the transmitter, receiver, and cache controller hardware to produce a highly integrated system-wide coherence mechanism. This thesis presents four protocols for fault-tolerant DSM and uses simulation and theoretical analysis to examine the performance of the protocols on the SOME-Bus multiprocessor. The proposed fault tolerance protocols exploit the inherent data distribution operations that occur as part of the management of shared data in DSMs in order to hide the overhead of fault tolerance. The increased availability of shared data for the support of fault tolerance can be used to enhance the performance of the DSM by increasing the likelihood that a request for data can be filled locally without requiring communication with remote nodes.

Full Text