Abstract

We are considering a parallel implementation of matrix-vector multiplication (GEMV, Level 2 of the BLAS) for graphics processing units (GPUs) using multiple-precision arithmetic based on the residue number system. In our GEMV implementation, element-wise operations with multiple-precision vectors and matrices consist of several parts, each of which is calculated by a separate CUDA kernel. This feature eliminates branch divergence when performing sequential parts of multiple-precision operations and allows the full utilization of the GPU’s resources. An efficient data structure for storing arrays with multiple-precision entries provides a coalesced access pattern to the GPU global memory. We have performed a rounding error analysis and derived error bounds for the proposed GEMV implementation. Experimental results show the high efficiency of the proposed solution compared to existing high-precision packages deployed on GPU.

Highlights

  • A separate Compute Unified Device Architecture (CUDA) kernel performs each piece with its configuration; all digits of multiple-precision numbers are calculated in parallel

  • Our experiments have shown that, in many cases, MPRES-Basic Linear Algebra Subprogram (BLAS) has better performance than implementations based on existing high-precision packages for central processing units (CPUs) and graphics processing units (GPUs)

  • We have presented a parallel implementation of the multiple-precision GEMV operation for systems with CUDA-compatible GPUs

Read more

Summary

Introduction

Floating-point operations have rounding errors that occur directly during calculations. A separate CUDA kernel performs each piece with its configuration; all digits of multiple-precision numbers are calculated in parallel This approach leads to an increase in the number of global memory accesses. It provides high performance and good scalability of computations with high precision on GPUs compared to the traditional paradigm where each multiple-precision arithmetic operation is performed as a single thread. To implement this approach, we use the residue number system (RNS) [12]. Conclusions and further research are presented in the last section of the paper

High-Precision Computations and BLAS for GPU
Representation of arbitrary length floating-point numbers using RNS
Data layout
Algorithms for implementing GEMV on GPUs
The case of non-transposed matrix
The case of transposed matrix
Accuracy evaluation
Performance results
Performance of individual CUDA kernels
Comparison with other implementations
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call