Abstract

This thesis focuses on the issue of reliability and fault tolerance in Distributed Shared Memory Multiprocessors, and on the performance impact of implementing fault tolerance protocols that allow for Backward Error Recovery through the use of synchronized checkpointing. High Performance Parallel computing systems that implement Distributed Shared Memory (DSM) require interconnection networks capable of providing low latency and high bandwidth and efficient support for multicast and synchronization operations. Software-based DSM systems rely on the operating system to manage the replicated memory pages and consequently their performance suffers due to operating system overhead, false sharing and page thrashing. In order to obtain high levels of performance, the activities related to maintaining the consistency of shared data in a DSM should be implemented in hardware so that latencies for data access can be minimized. The recoverable DSM system examined in this thesis is intended for the class of broadcast-based interconnection networks in order to provide the low latencies required for the application workloads characteristic of DSM. An example of this class of interconnection network is the Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus). The unique architecture of the SOME-Bus provides for strong integration of the transmitter, receiver, and cache controller hardware to produce a highly integrated system-wide coherence mechanism. This thesis presents four protocols for fault-tolerant DSM and uses simulation and theoretical analysis to examine the performance of the protocols on the SOME-Bus multiprocessor. The proposed fault tolerance protocols exploit the inherent data distribution operations that occur as part of the management of shared data in DSMs in order to hide the overhead of fault tolerance. The increased availability of shared data for the support of fault tolerance can be used to enhance the performance of the DSM by increasing the likelihood that a request for data can be filled locally without requiring communication with remote nodes.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.