Exploring the Use of Novel Spatial Accelerators in Scientific Applications

Rizwan A Ashraf,Roberto Gioiosa

doi:10.1145/3489525.3511690

Abstract

Driven by the need to find alternative accelerators which can viably replace graphics processing units (GPUs) in next-generation Supercomputing systems, this paper proposes a methodology to enable agile application/hardware co-design. The application-first methodology provides the ability to come up with design of accelerators while working with real-world workloads, available accelerators, and system software. The iterative design process targets a set of kernels in a workload for performance estimates that can prune the design space for later phases of detailed architectural evaluations. To this effect, in this paper, a novel data-parallel device model is introduced that simulates the latency of performance-sensitive operations in an accelerator including data transfers and kernel computation using multi-core CPUs. The use of off-the-shelf simulators, such as pre-RTL simulator Aladdin or multiple tools available for exploring the design of deep neural network accelerators (e.g., Timeloop) is demonstrated for evaluation of various accelerator designs using applications with realistic inputs. Examples of multiple device configurations that are instantiable in a system are explored to evaluate the performance benefit of deploying novel accelerators. The proposed device is integrated with a programming model and system software to potentially explore the impacts of high-level programming languages/compilers and low-level effects such as task scheduling on multiple accelerators. We analyze our methodology for a set of applications that represent high-performance computing (HPC) and graph analytics. The applications include a computational chemistry kernel realized using tensor contractions, triangle counting, GraphSAGE and Breadth-first Search. These applications include kernels such as dense matrix-dense matrix multiplication, sparse matrix-spare matrix multiplication, and sparse matrix-dense vector multiplication. Our results indicate potential performance benefits and insights for system design by including accelerators that realize these kernels along-side general purpose accelerators.

Full Text