Abstract

With the emergence of accelerators like GPUs, MICs and FPGAs, the availability of domain specific libraries (like MKL) and the ease of parallelization associated with CUDA and OpenMP based shared-memory programming, node-based parallelization has recently become a popular choice among developers in the field of scientific computing. This is evident from the large volume of recently published work in various domains of scientific computing, where shared-memory programming and accelerators have been used to accelerate applications. Although these approaches are suitable for small problem-sizes, there are issues that need to be addressed for them to be applicable to larger input domains. Firstly, the primary focus of these works has been to accelerate the core kernel; acceleration of input/output operations is seldom considered. Many operations in scientific computing operate on large matrices - both sparse and dense - that are read from and written to external files. These input-output operations present themselves as bottlenecks and significantly effect the overall application time. Secondly, node-based parallelization limits a developer from distributing the computation beyond a single node without him having to learn an additional programming paradigm like MPI. Thirdly, the problem size that can be effectively handled by a node is limited by the memory of the node and accelerator. In this paper, an Asynchronous Multi-node Execution (AMNE) approach is presented that uses a unique combination of the shared-file system and pseudo-replication to extend node-based algorithms to a distributed multiple node implementation with minimal changes to the original node-based code. We demonstrate this approach by applying it to GEMM, a popular kernel in dense linear algebra and show that the presented methodology significantly advances the state of art in the field of parallelization and scientific computing.

Highlights

  • Many applications in engineering and scientific computing involve operations on dense and sparse matrices [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11]

  • This paper presents Asynchronous Multi-node Execution (AMNE) that attempts to address the afore-mentioned challenges associated with parallelizing a class of applications known as embarrassinglyparallel applications [23]

  • Single-node programming paradigms like OpenMP, CUDA and OpenACC have recently gained popularity among researchers in Engineering and scientific computing and optimized libraries like Intel’s MKL, NVidia’s cuBLAS, and PLASMA are available to them that have been optimized for execution on single-nodes

Read more

Summary

INTRODUCTION

Many applications in engineering and scientific computing involve operations on dense and sparse matrices [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11]. When large dense matrices are involved, the GEMM operation, in addition to being computationally expensive, becomes I/O intensive as well and making the operation holistically efficient becomes more challenging. A single dense matrix of size 50k × 50k used to store elements of double floating-point precision will require approximately 18.6GB of RAM This makes the GEMM operation for these large matrices non-trivial to implement on a node with 32GB of memory or a high-end GPU with 16GB of device memory. One way to address this problem is to distribute the computation to multiple nodes This will require the developer to port the existing code written for a single-node to a distributed architecture using a distributed-memory paradigm like MPI.

MOTIVATION
ASYNCHRONOUS MULTI-NODE EXECUTION
Out-of-Memory Slicing
Code Modification
Job Script Preparation
Pseudo-Replication
Experimental Testbed
Benchmarks
Slicing
Launcher and Job Scripts
Results and Analysis
Comparison with other Approaches
RELATED WORK
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call