Compressed Sparse Row Format Research Articles

Sparse matrix-vector multiplication (SpMV) is central to many scientific, engineering, and other applications, including machine learning. Compressed Sparse Row (CSR) is a widely used sparse matrix storage format. SpMV using the CSR format on GPU computing platforms is widely studied, where the access behavior of GPU is often the performance bottleneck. The Ampere GPU architecture recently from NVIDIA provides a new asynchronous memory copy instruction, memcpy_async, for more efficient data movement in shared memory. Leveraging the capability of this new memcpy_async instruction, we first propose the CSR-Partial-Overlap to carefully overlap the data copy from global memory to shared memory and computation, allowing us to take full advantage of the data transfer time. In addition, we design the dynamic batch partition and the dynamic threads distribution to achieve effective load balancing, avoid the overhead of fixing up partial sums, and improve thread utilization. Furthermore, we propose the CSR-Full-Overlap based on the CSR-Partial-Overlap, which takes the overlap of data transfer from host to device and SpMV kernel execution into account as well. The CSR-Full-Overlap unifies the two major overlaps in SpMV and hides the computation as much as possible in the two important access behaviors of the GPU. This allows CSR-Full-Overlap to achieve the best performance gains from both overlaps. As far as we know, this paper is the first in-depth study of how memcpy_async can be potentially applied to help accelerate SpMV computation in GPU platforms. We compare CSR-Full-Overlap to the current state-of-the-art cuSPARSE, where our experimental results show an average 2.03x performance gain and up to 2.67x performance gain.

Read full abstract

Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU's relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors' group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors' method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H&N) cancer case is then used to validate the authors' method. The authors also compare their multi-GPU implementation with three different single GPU implementation strategies, i.e., truncating DDC matrix (S1), repeatedly transferring DDC matrix between CPU and GPU (S2), and porting computations involving DDC matrix to CPU (S3), in terms of both plan quality and computational efficiency. Two more H&N patient cases and three prostate cases are used to demonstrate the advantages of the authors' method. The authors' multi-GPU implementation can finish the optimization process within ∼ 1 min for the H&N patient case. S1 leads to an inferior plan quality although its total time was 10 s shorter than the multi-GPU implementation due to the reduced matrix size. S2 and S3 yield the same plan quality as the multi-GPU implementation but take ∼4 and ∼6 min, respectively. High computational efficiency was consistently achieved for the other five patient cases tested, with VMAT plans of clinically acceptable quality obtained within 23-46 s. Conversely, to obtain clinically comparable or acceptable plans for all six of these VMAT cases that the authors have tested in this paper, the optimization time needed in a commercial TPS system on CPU was found to be in an order of several minutes. The results demonstrate that the multi-GPU implementation of the authors' column-generation-based VMAT optimization can handle the large-scale VMAT optimization problem efficiently without sacrificing plan quality. The authors' study may serve as an example to shed some light on other large-scale medical physics problems that require multi-GPU techniques.

Read full abstract

Compressed Sparse Row Format Research Articles

Related Topics

Articles published on Compressed Sparse Row Format

SpEpistasis: A sparse approach for three-way epistasis detection

Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUs

Sparse matrix‐vector and matrix‐multivector products for the truncated SVD on graphics processors

Algebraic Multigrid Using a Stencil–CSR Hybrid Format on GPUs

Time domain boundary element method for semi-infinite domain problems using CSR storage method

A Ternary Neural Network with Compressed Quantized Weight Matrix for Low Power Embedded Systems

A Pattern-Based SpGEMM Library for Multi-Core and Many-Core Architectures

SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator

Improving performance of iterative solvers with the AXC format using the Intel Xeon Phi

GPU-accelerated element-free reverse-time migration with Gauss points partition

GPU Implementation of Image Convolution Using Sparse Model with Efficient Storage Format

Domain decomposition parallel computing for transient two-phase flow of nuclear reactors

A hybrid format for better performance of sparse matrix-vector multiplication on a GPU

Parallel Approaches and Technologies of Domain Decomposition Methods

Multi-GPU implementation of a VMAT treatment plan optimization algorithm

The EIT Forward Problem Parallelized Using a Colored pJDS Matrix Format

Solving large tomographic linear systems: size reduction and error estimation

Solving the Examination Timetabling Problem in GPUs

Optimization of quasi-diagonal matrix–vector multiplication on GPU

Optimization of the ILU(0) factorization algorithm with the use of compressed sparse row format

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Compressed Sparse Row Format Research Articles

Related Topics

Articles published on Compressed Sparse Row Format

SpEpistasis: A sparse approach for three-way epistasis detection

Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUs

Sparse matrix‐vector and matrix‐multivector products for the truncated SVD on graphics processors

Algebraic Multigrid Using a Stencil–CSR Hybrid Format on GPUs

Time domain boundary element method for semi-infinite domain problems using CSR storage method

A Ternary Neural Network with Compressed Quantized Weight Matrix for Low Power Embedded Systems

A Pattern-Based SpGEMM Library for Multi-Core and Many-Core Architectures

SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator

Improving performance of iterative solvers with the AXC format using the Intel Xeon Phi

GPU-accelerated element-free reverse-time migration with Gauss points partition

GPU Implementation of Image Convolution Using Sparse Model with Efficient Storage Format

Domain decomposition parallel computing for transient two-phase flow of nuclear reactors

A hybrid format for better performance of sparse matrix-vector multiplication on a GPU

Parallel Approaches and Technologies of Domain Decomposition Methods

Multi-GPU implementation of a VMAT treatment plan optimization algorithm

The EIT Forward Problem Parallelized Using a Colored pJDS Matrix Format

Solving large tomographic linear systems: size reduction and error estimation

Solving the Examination Timetabling Problem in GPUs

Optimization of quasi-diagonal matrix–vector multiplication on GPU

Optimization of the ILU(0) factorization algorithm with the use of compressed sparse row format