Abstract

In this dissertation, we explore multiple designs for a Distributed Transactional Memory framework for GPU clusters. Using Transactional Memory, we relieve the programmer of many concerns including 1) how to move data between many discrete memory spaces; 2) how to ensure data correctness when shared objects may be accessed by multiple devices; 3) how to prevent catastrophic warp divergence caused by atomic operations; 4) how to prevent catastrophic warp divergence caused by long-latency off-device communications; and 5) how to ensure Atomicity, Consistency, Isolation, Durability for programs with irregular memory accesses. Each of these concerns individually can be daunting to programmers who lack expert knowledge of the GPUs architectural quirks including the use of SIMD, weak memory model, and lack of direct access to a NIC. The goal of this work is to significantly reduce the programming effort required to realize performant GPU applications despite workload characteristics that are not favorable to the underlying architecture. Using our automatic concurrency control system, CUDA-DTM, programmers can convert some traditional applications to GPU applications in an afternoon that would have otherwise taken months to develop and debug. We analyze the performance and workload flexibility of CUDA-DTM, the first ever Distributed Transactional Memory framework written in CUDA for large scale GPU clusters. Transactional Memory has become an attractive concurrency control scheme for GPU applications with irregular memory access patterns due to its ability to avoid serializing threads while still maintaining programmability and preventing deadlocks. CUDA-DTM extends existing GPU Software Transactional Memory model to allow individual threads across many GPUs to initiate access to a Partitioned Global Address Space using a proposed scheme for GPU-to-GPU communication using CUDA-Aware MPI. CUDA-DTM allows programmers to treat individual GPU threads as though they were as flexible and independent as CPU threads, using a run-time system that automatically resolves conflicts, prevents warp-divergence, facilitates data movement between host and device memory spaces, and preserves data integrity. While CUDA-DTM will ensure applications run correctly and to completion without deadlocks or live-locks, there is no free lunch and programmers must be aware of the underlying architecture as well as locality of PGAS memory accesses to achieve the "100x" speedup that is so enticingly advertised by existing GPU literature. In fact, achieving the best-performing implementation for a given workload would likely require replacing each of the CUDA-DTM components with an application-specific design. However, using CUDA-DTM, programmers can rapidly test the suitability of GPU clusters

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call