Linear Algebra Computations Research Articles

Sparse matrix-vector multiplication (SpMV) plays a critical role in a wide range of linear algebra computations, particularly in scientific and engineering disciplines. However, the irregular memory access patterns, extensive memory usage, high bandwidth requirements, and underutilization of parallelism hinder the computational efficiency of SpMV on GPUs. In this paper, we propose a novel approach called block-wise dynamic mixed-precision (BDMP) to address these challenges. Our methodology involves partitioning the original matrix into uniformly sized blocks, with each block’s size determined by considering architectural characteristics and accuracy requirements. Additionally, we dynamically assign precision to each block using a precision selection method that takes into account the value distribution of the original sparse matrix. We develop two distinct SpMV computation algorithms for BDMP: BDMP-PBP (Precision-based partitioning) and BDMP-TCKI (Tailored compression and kernel implementation). BDMP-PBP partitions the matrix into two independent matrices for separate computations based on block precision, offering flexibility for integration with other optimization techniques. Meanwhile, BDMP-TCKI focuses on achieving significant thread-level parallelism and memory utilization by tailoring an appropriate compressed storage format and kernel implementation for each block. We compare BDMP with NVIDIA’s cuSPARSE library and three state-of-the-art SpMV methods, including SELLP, MergeBase, and BalanceCSR, using matrices from the University of Florida’s SuiteSparse dataset collection. BDMP-PBP and BDMP-TCKI show average speedups up to 2.64×\\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{mathrsfs} \\usepackage{upgreek} \\setlength{\\oddsidemargin}{-69pt} \\begin{document}$$\ imes $$\\end{document} and 2.91×\\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{mathrsfs} \\usepackage{upgreek} \\setlength{\\oddsidemargin}{-69pt} \\begin{document}$$\ imes $$\\end{document} on Turing RTX 2080Ti, and up to 2.99×\\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{mathrsfs} \\usepackage{upgreek} \\setlength{\\oddsidemargin}{-69pt} \\begin{document}$$\ imes $$\\end{document} and 3.22×\\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{mathrsfs} \\usepackage{upgreek} \\setlength{\\oddsidemargin}{-69pt} \\begin{document}$$\ imes $$\\end{document} on Ampere A100. The results demonstrate that BDMP enables the optimization of computation speed without compromising the precision necessary for reliable results.

Read full abstract

The Sparse Matrix-Vector Multiplication (SpMV) kernel is used in a broad class of linear algebra computations. SpMV computations result in a performance bottleneck in many high performance applications, so optimizing SpMV performance is paramount. While implementing this kernel on a GPU can potentially boost performance significantly, current GPU libraries either provide modest performance gains or are burdened with high sparse format conversion overhead. In this paper we introduce the Vertical Compressed Sparse Row (VCSR) format, a novel memory-aware format that out-performs previous proposed formats on a GPU. We first motivate the design of our baseline VCSR format and then step through a series of enhancements that further improve VCSR's memory efficiency (VCSR-MEM) and performance (VCSR-INTRLV), while also considering conversion overhead. VCSR attempts to produce a high degree of thread-level parallelism and memory utilization by exploiting knowledge of GPU memory microarchitecture. VCSR can reduce the number of global memory transactions significantly, an issue not addressed by most other sparse formats. In addition, VCSR provides a novel reordering mechanism. It minimizes the size of the compressed matrix, handles both regular/irregular sparse matrices, and can be customized based on matrix size. VCSR also minimizes conversion overhead, as compared to full or partial row reordering. Our methodology is highly configurable and can be optimized for any sparse matrix. We have evaluated the VCSR format for the SpMV kernel when run on two different NVIDIA GPUs, the Kepler K40 and the Volta V100. We compare VCSR with NVIDIA's cuSPARSE library (the HYB format), a state-of-the-art sparse library. We also compare against other state-of-the-art CSR-based formats, including CSR5, merge-base SpMV and HOLA. We evaluate the benefits of VCSR over the entire University of Florida's SuiteSparse dataset collection. The VCSR-baseline format achieves an average speedup ranging from <inline-formula><tex-math notation="LaTeX">$1.10\times$</tex-math></inline-formula> to <inline-formula><tex-math notation="LaTeX">$1.39\times$</tex-math></inline-formula> when compared to the performance of the four state-of-the-art formats on an NVIDIA V100. While the VCSR-MEM format can save a significant amount of memory space, it is a bit slower than our VCSR-baseline. VCSR-INTRLV performs much better than the VCSR-baseline, and even when including the conversion overhead, achieves an average speedup of <inline-formula><tex-math notation="LaTeX">$1.08\times$</tex-math></inline-formula> as compared to HOLA (the best performing format among the prior schemes).

Read full abstract

Linear Algebra Computations Research Articles

Related Topics

Articles published on Linear Algebra Computations

Novel progressive deep learning algorithm for uncovering multiple single nucleotide polymorphism interactions to predict paclitaxel clearance in patients with nonsmall cell lung cancer.

A High-Speed Floating Point Matrix Multiplier Implemented in Reconfigurable Architecture

Accurate Computations with Block Checkerboard Pattern Matrices

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Givens rotations for QR decomposition, SVD and PCA over database joins

Addition and intersection of linear time-invariant behaviors

Improvement of variables interpretability in kernel PCA

Task-based Parallel Programming for Scalable Matrix Product Algorithms

Authenticated key agreement scheme for IoT networks exploiting lightweight linear algebraic computations

Spectral Ranking in Complex Networks Using Memristor Crossbars

A GPU parallel randomized CUR compression method for the Method of Moments

VCSR: An Efficient GPU Memory-Aware Sparse Format

Energy Efficient Approximate 3D Image Reconstruction

The Linear Algebra Mapping Problem. Current State of Linear Algebra Languages and Libraries

A survey on machine learning in array databases

A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level

Towards electronic structure-based ab-initio molecular dynamics simulations with hundreds of millions of atoms

Accelerating approximate matrix multiplication for near-sparse matrices on GPUs

A Parallel Approach of the Enhanced Craig–Bampton Method

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Linear Algebra Computations Research Articles

Related Topics

Articles published on Linear Algebra Computations

Novel progressive deep learning algorithm for uncovering multiple single nucleotide polymorphism interactions to predict paclitaxel clearance in patients with nonsmall cell lung cancer.

A High-Speed Floating Point Matrix Multiplier Implemented in Reconfigurable Architecture

Accurate Computations with Block Checkerboard Pattern Matrices

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Givens rotations for QR decomposition, SVD and PCA over database joins

Addition and intersection of linear time-invariant behaviors

Improvement of variables interpretability in kernel PCA

Task-based Parallel Programming for Scalable Matrix Product Algorithms

Authenticated key agreement scheme for IoT networks exploiting lightweight linear algebraic computations

Spectral Ranking in Complex Networks Using Memristor Crossbars

A GPU parallel randomized CUR compression method for the Method of Moments

VCSR: An Efficient GPU Memory-Aware Sparse Format

Energy Efficient Approximate 3D Image Reconstruction

The Linear Algebra Mapping Problem. Current State of Linear Algebra Languages and Libraries

A survey on machine learning in array databases

A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level

Towards electronic structure-based ab-initio molecular dynamics simulations with hundreds of millions of atoms

Accelerating approximate matrix multiplication for near-sparse matrices on GPUs

A Parallel Approach of the Enhanced Craig–Bampton Method