SpMV Computations Research Articles

Sparse matrix-vector multiplication (SpMV) plays a critical role in a wide range of linear algebra computations, particularly in scientific and engineering disciplines. However, the irregular memory access patterns, extensive memory usage, high bandwidth requirements, and underutilization of parallelism hinder the computational efficiency of SpMV on GPUs. In this paper, we propose a novel approach called block-wise dynamic mixed-precision (BDMP) to address these challenges. Our methodology involves partitioning the original matrix into uniformly sized blocks, with each block’s size determined by considering architectural characteristics and accuracy requirements. Additionally, we dynamically assign precision to each block using a precision selection method that takes into account the value distribution of the original sparse matrix. We develop two distinct SpMV computation algorithms for BDMP: BDMP-PBP (Precision-based partitioning) and BDMP-TCKI (Tailored compression and kernel implementation). BDMP-PBP partitions the matrix into two independent matrices for separate computations based on block precision, offering flexibility for integration with other optimization techniques. Meanwhile, BDMP-TCKI focuses on achieving significant thread-level parallelism and memory utilization by tailoring an appropriate compressed storage format and kernel implementation for each block. We compare BDMP with NVIDIA’s cuSPARSE library and three state-of-the-art SpMV methods, including SELLP, MergeBase, and BalanceCSR, using matrices from the University of Florida’s SuiteSparse dataset collection. BDMP-PBP and BDMP-TCKI show average speedups up to 2.64×\\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{mathrsfs} \\usepackage{upgreek} \\setlength{\\oddsidemargin}{-69pt} \\begin{document}$$\ imes $$\\end{document} and 2.91×\\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{mathrsfs} \\usepackage{upgreek} \\setlength{\\oddsidemargin}{-69pt} \\begin{document}$$\ imes $$\\end{document} on Turing RTX 2080Ti, and up to 2.99×\\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{mathrsfs} \\usepackage{upgreek} \\setlength{\\oddsidemargin}{-69pt} \\begin{document}$$\ imes $$\\end{document} and 3.22×\\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{mathrsfs} \\usepackage{upgreek} \\setlength{\\oddsidemargin}{-69pt} \\begin{document}$$\ imes $$\\end{document} on Ampere A100. The results demonstrate that BDMP enables the optimization of computation speed without compromising the precision necessary for reliable results.

Read full abstract

The Sparse Matrix-Vector Multiplication (SpMV) kernel is used in a broad class of linear algebra computations. SpMV computations result in a performance bottleneck in many high performance applications, so optimizing SpMV performance is paramount. While implementing this kernel on a GPU can potentially boost performance significantly, current GPU libraries either provide modest performance gains or are burdened with high sparse format conversion overhead. In this paper we introduce the Vertical Compressed Sparse Row (VCSR) format, a novel memory-aware format that out-performs previous proposed formats on a GPU. We first motivate the design of our baseline VCSR format and then step through a series of enhancements that further improve VCSR's memory efficiency (VCSR-MEM) and performance (VCSR-INTRLV), while also considering conversion overhead. VCSR attempts to produce a high degree of thread-level parallelism and memory utilization by exploiting knowledge of GPU memory microarchitecture. VCSR can reduce the number of global memory transactions significantly, an issue not addressed by most other sparse formats. In addition, VCSR provides a novel reordering mechanism. It minimizes the size of the compressed matrix, handles both regular/irregular sparse matrices, and can be customized based on matrix size. VCSR also minimizes conversion overhead, as compared to full or partial row reordering. Our methodology is highly configurable and can be optimized for any sparse matrix. We have evaluated the VCSR format for the SpMV kernel when run on two different NVIDIA GPUs, the Kepler K40 and the Volta V100. We compare VCSR with NVIDIA's cuSPARSE library (the HYB format), a state-of-the-art sparse library. We also compare against other state-of-the-art CSR-based formats, including CSR5, merge-base SpMV and HOLA. We evaluate the benefits of VCSR over the entire University of Florida's SuiteSparse dataset collection. The VCSR-baseline format achieves an average speedup ranging from <inline-formula><tex-math notation="LaTeX">$1.10\times$</tex-math></inline-formula> to <inline-formula><tex-math notation="LaTeX">$1.39\times$</tex-math></inline-formula> when compared to the performance of the four state-of-the-art formats on an NVIDIA V100. While the VCSR-MEM format can save a significant amount of memory space, it is a bit slower than our VCSR-baseline. VCSR-INTRLV performs much better than the VCSR-baseline, and even when including the conversion overhead, achieves an average speedup of <inline-formula><tex-math notation="LaTeX">$1.08\times$</tex-math></inline-formula> as compared to HOLA (the best performing format among the prior schemes).

Read full abstract

SpMV Computations Research Articles

Related Topics

Articles published on SpMV Computations

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUs

VCSR: An Efficient GPU Memory-Aware Sparse Format

AAQAL: A Machine Learning-Based Tool for Performance Optimization of Parallel SPMV Computations Using Block CSR

Optimization of Sparse Distributed Computations

SparseP

DIESEL: A novel deep learning-based tool for SpMV computations and solving sparse linear equation systems

Iteratively solving sparse linear system based on PaRSEC task scheduling

ZAKI: A Smart Method and Tool for Automatic Performance Optimization of Parallel SpMV Computations on Distributed Memory Machines

Optimization techniques for sparse matrix–vector multiplication on GPUs

Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU

Yet another Hybrid Strategy for Auto-tuning SpMV on GPUs

New Sparse Matrix Storage Format to Improve The Performance of Total SPMV Time

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

SpMV Computations Research Articles

Related Topics

Articles published on SpMV Computations

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUs

VCSR: An Efficient GPU Memory-Aware Sparse Format

AAQAL: A Machine Learning-Based Tool for Performance Optimization of Parallel SPMV Computations Using Block CSR

Optimization of Sparse Distributed Computations

SparseP

DIESEL: A novel deep learning-based tool for SpMV computations and solving sparse linear equation systems

Iteratively solving sparse linear system based on PaRSEC task scheduling

ZAKI: A Smart Method and Tool for Automatic Performance Optimization of Parallel SpMV Computations on Distributed Memory Machines

Optimization techniques for sparse matrix–vector multiplication on GPUs

Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU

Yet another Hybrid Strategy for Auto-tuning SpMV on GPUs

New Sparse Matrix Storage Format to Improve The Performance of Total SPMV Time