Extending Shared-Memory Computations to Multiple Distributed Nodes

Waseem Ahmed

doi:10.14569/ijacsa.2020.0110882

Abstract

With the emergence of accelerators like GPUs, MICs and FPGAs, the availability of domain specific libraries (like MKL) and the ease of parallelization associated with CUDA and OpenMP based shared-memory programming, node-based parallelization has recently become a popular choice among developers in the field of scientific computing. This is evident from the large volume of recently published work in various domains of scientific computing, where shared-memory programming and accelerators have been used to accelerate applications. Although these approaches are suitable for small problem-sizes, there are issues that need to be addressed for them to be applicable to larger input domains. Firstly, the primary focus of these works has been to accelerate the core kernel; acceleration of input/output operations is seldom considered. Many operations in scientific computing operate on large matrices - both sparse and dense - that are read from and written to external files. These input-output operations present themselves as bottlenecks and significantly effect the overall application time. Secondly, node-based parallelization limits a developer from distributing the computation beyond a single node without him having to learn an additional programming paradigm like MPI. Thirdly, the problem size that can be effectively handled by a node is limited by the memory of the node and accelerator. In this paper, an Asynchronous Multi-node Execution (AMNE) approach is presented that uses a unique combination of the shared-file system and pseudo-replication to extend node-based algorithms to a distributed multiple node implementation with minimal changes to the original node-based code. We demonstrate this approach by applying it to GEMM, a popular kernel in dense linear algebra and show that the presented methodology significantly advances the state of art in the field of parallelization and scientific computing.

Highlights

Many applications in engineering and scientific computing involve operations on dense and sparse matrices [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11]
This paper presents Asynchronous Multi-node Execution (AMNE) that attempts to address the afore-mentioned challenges associated with parallelizing a class of applications known as embarrassinglyparallel applications [23]
Single-node programming paradigms like OpenMP, CUDA and OpenACC have recently gained popularity among researchers in Engineering and scientific computing and optimized libraries like Intel’s MKL, NVidia’s cuBLAS, and PLASMA are available to them that have been optimized for execution on single-nodes

Summary

INTRODUCTION

Many applications in engineering and scientific computing involve operations on dense and sparse matrices [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11]. When large dense matrices are involved, the GEMM operation, in addition to being computationally expensive, becomes I/O intensive as well and making the operation holistically efficient becomes more challenging. A single dense matrix of size 50k × 50k used to store elements of double floating-point precision will require approximately 18.6GB of RAM This makes the GEMM operation for these large matrices non-trivial to implement on a node with 32GB of memory or a high-end GPU with 16GB of device memory. One way to address this problem is to distribute the computation to multiple nodes This will require the developer to port the existing code written for a single-node to a distributed architecture using a distributed-memory paradigm like MPI.

MOTIVATION

ASYNCHRONOUS MULTI-NODE EXECUTION

Out-of-Memory Slicing

Code Modification

Job Script Preparation

Pseudo-Replication

Experimental Testbed

Benchmarks

Slicing

Launcher and Job Scripts

Results and Analysis

Comparison with other Approaches

RELATED WORK

CONCLUSION

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Extending Shared-Memory Computations to Multiple Distributed Nodes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications

Lead the way for us

Journal: International Journal of Advanced Computer Science and Applications	Publication Date: Jan 1, 2020
License type: cc-by

Similar Papers

Preface
-
Journal of Physics: Conference Series | VOL. 1255
--
01 Aug 2019
Journal of Physics: Conference Series | VOL. 1255

Machine learning in glaucoma: a bibliometric analysis comparing computer science and medical fields’ research
Saif Aldeen Alryalat ... Soukaina Ryalat
Expert Review of Ophthalmology | VOL. ahead-of-print
Saif Aldeen Alryalat, et. al.Saif Aldeen Alryalat ... Soukaina Ryalat
13 Aug 2021
Expert Review of Ophthalmology | VOL. ahead-of-print

Re-Examining Inequalities in Computer Science Participation from a Bourdieusian Sociological Perspective
Maria Kallia ... Quintin Cutts
-
Maria Kallia, et. al.Maria Kallia ... Quintin Cutts
16 Aug 2021
16 Aug 2021

Preface
-
Journal of Physics: Conference Series | VOL. 1712
--
01 Dec 2020
Journal of Physics: Conference Series | VOL. 1712

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Extending Shared-Memory Computations to Multiple Distributed Nodes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications