SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs

Thaha Muhammed,Rashid Mehmood,Aiiad Albeshri,Iyad Katib

doi:10.3390/app9050947

Abstract

Sparse matrix-vector (SpMV) multiplication is a vital building block for numerous scientific and engineering applications. This paper proposes SURAA (translates to speed in arabic), a novel method for SpMV computations on graphics processing units (GPUs). The novelty lies in the way we group matrix rows into different segments, and adaptively schedule various segments to different types of kernels. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row ( n p r) and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman–Diaconis rule. The segments are assembled into three groups based on the mean n p r of the segments. For each group, we use multiple kernels to execute the group segments on different streams. Hence, the number of threads to execute each segment is adaptively chosen. Dynamic Parallelism available in Nvidia GPUs is utilized to execute the group containing segments with the largest mean n p r, providing improved load balancing and coalesced memory access, and hence more efficient SpMV computations on GPUs. Therefore, SURAA minimizes the adverse effects of the n p r variance by uniformly distributing the load using equal sized segments. We implement the SURAA method as a tool and compare its performance with the de facto best commercial (cuSPARSE) and open source (CUSP, MAGMA) tools using widely used benchmarks comprising 26 high n p r v a r i a n c e matrices from 13 diverse domains. SURAA outperforms the other tools by delivering 13.99x speedup on average. We believe that our approach provides a fundamental shift in addressing SpMV related challenges on GPUs including coalesced memory access, thread divergence, and load balancing, and is set to open new avenues for further improving SpMV performance in the future.

Highlights

Sparse Linear algebra is vital to scientific computations and various fields of engineering and has been included among the seven dwarfs [1] by the Berkeley researchers
The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row and forming segments of equal size using the Freedman–Diaconis rule [54,55]
To the best of our knowledge, no other work exists that used the Freedman–Diaconis rule for Sparse Matrix-Vector (SpMV) computations the way we used, or used adaptive kernels the way we have done in this paper

Summary

Introduction

Sparse Linear algebra is vital to scientific computations and various fields of engineering and has been included among the seven dwarfs [1] by the Berkeley researchers. Specialized storage structures are used to improve the performance of SpMV These structures have design issues for translating it to GPUs. The major issues include coalesced memory access to both the sparse matrix A and the vector y, load balance among array threads and warps, thread divergence among a warp of threads, performance variance based on the structure of the sparse matrices, and the amount of memory access required for computations. Dynamic parallelism available in Nvidia GPUs (Major Version 3.5, NVIDIA Corporation, Santa Clara, CA, USA) is utilized to execute the group containing segments with the largest mean npr, providing improved load balancing and coalesced memory access, and more efficient SpMV computations on GPUs. Note that the row segments in SURAA are dynamically scheduled by the dynamic kernel and are executed at the same time using the hardware streams on the GPU.

Sparse Storage Schemes

Dynamic Parallelism

Literature Survey

SURAA: The Proposed Method and Tool

Setup Phase

Dynamic Parallel Kernel

Scalar Kernel

Vector Kernel

Section Summary and Clarifications

Results and Analysis

Experimental Testbed

Benchmark Suite

SpMV Performance

Throughput and Speedup

Aggregate Throughput and Speedup

Effective Memory Bandwidth

SURAA: Comparative Performance against High npr Variance

SURAA: Parametric Configuration

Preprocessing Cost

Sparse Iterative Solver Performance

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied sciences	Publication Date: Mar 6, 2019
Citations: 20	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied sciences

Lead the way for us

Similar Papers

GPU Accelerated Matrix Factorization for Recommender Systems
Doruk Kilitcioglu ... Nicholas Greenquist
-
Doruk Kilitcioglu, et. al.Doruk Kilitcioglu ... Nicholas Greenquist
05 Mar 2021
05 Mar 2021

Performance Analysis of Sparse Matrix-Vector Multiplication (SpMV) on Graphics Processing Units (GPUs)
Sarah Alahmadi ... Iyad Katib
Electronics | VOL. 9
Sarah Alahmadi, et. al.Sarah Alahmadi ... Iyad Katib
13 Oct 2020
Electronics | VOL. 9

General Purpose Computation on Graphics Processing Units Using OpenCL

-

01 Jan 2013
01 Jan 2013

Portability for GPU-accelerated molecular docking applications for cloud and HPC: can portable compiler directives provide performance across all platforms?
Mathialakan Thavappiragasam ... Wael Elwasif
-
Mathialakan Thavappiragasam, et. al.Mathialakan Thavappiragasam ... Wael Elwasif
01 May 2022
01 May 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied sciences