BLAS Kernels Research Articles

Krylov subspace iterative solvers are often the method of choice when solving large sparse linear systems. At the same time, hardware accelerators such as graphics processing units continue to offer significant floating point performance gains for matrix and vector computations through easy-to-use libraries of computational kernels. However, as these libraries are usually composed of a well optimized but limited set of linear algebra operations, applications that use them often fail to reduce certain data communications, and hence fail to leverage the full potential of the accelerator. In this paper, we target the acceleration of Krylov subspace iterative methods for graphics processing units, and in particular the Biconjugate Gradient Stabilized solver that significant improvement can be achieved by reformulating the method to reduce data-communications through application-specific kernels instead of using the generic BLAS kernels, e.g. as provided by NVIDIA’s cuBLAS library, and by designing a graphics processing unit specific sparse matrix-vector product kernel that is able to more efficiently use the graphics processing unit’s computing power. Furthermore, we derive a model estimating the performance improvement, and use experimental data to validate the expected runtime savings. Considering that the derived implementation achieves significantly higher performance, we assert that similar optimizations addressing algorithm structure, as well as sparse matrix-vector, are crucial for the subsequent development of high-performance graphics processing units accelerated Krylov subspace iterative methods.

Read full abstract

The Dynamically Partitioned Data-Flow (DPDF) model is based on an original analysis concept of the data dependency graph at the instruction level. Instead of a breadth first analysis, as in a classical Data-Flow Model, we execute instructions along data-dependent paths. As a consequence, data locality can be exploited by reusing results between the execution of consecutive instructions. In addition, the different paths are not statically defined but arise from a dynamical partitioning of the graph. This model presents the advantage to support very small cost dynamic scheduling and multitasking strategies. In order to study the efficiency of this new model, a first architecture has been defined. This architecture is currently limited to a single processor with one serial processing unit but four graph analyzing units (called prefetch units). Each of these prefetch units is able to build dynamically its own execution path inside the Data-Flow graph of an application. The efficiency of this architecture is studied on a numerical benchmark composed of a subset of the Livermore loops and of three routines of the Level 3 BLAS (GEMM, SYRK and TRSM). Our goal in these experimentations is to demonstrate the ability of the four prefetch units to feed the ALU.

Read full abstract

BLAS Kernels Research Articles

Articles published on BLAS Kernels

Hierarchical approach for deriving a reproducible unblocked LU factorization

Optimized sparse Cholesky factorization on hybrid multicore architectures

Reordering Strategy for Blocking Optimization in Sparse Linear Solvers

Acceleration of GPU-based Krylov solvers via data transfer reduction

A BLAS-3 Version of the QR Factorization with Column Pivoting

A block variant of the GMRES method on massively parallel processors

A block varaint of the GMRES method for unsymmetric linear systems

Performance of level 3 BLAS kernels in a dynamically partitioned data-flow environment

Highly nonnormal eigenproblems in the aeronautical industry

Comparisons of Gaussian elimination algorithms on a cray Y-MP

Block-Cholesky for parallel processing

Use of Level 3 Blas in Lu Factorization in a Multiprocessing Environment On Three Vector Multiprocessors: the Alliant Fx/80, the Cray-2, and the Ibm 3090 Vf

Level 3 Blas in Lu Factorization On the Cray-2, Eta-10P, and Ibm 3090-200/Vf

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

BLAS Kernels Research Articles

Articles published on BLAS Kernels

Hierarchical approach for deriving a reproducible unblocked LU factorization

Optimized sparse Cholesky factorization on hybrid multicore architectures

Reordering Strategy for Blocking Optimization in Sparse Linear Solvers

Acceleration of GPU-based Krylov solvers via data transfer reduction

A BLAS-3 Version of the QR Factorization with Column Pivoting

A block variant of the GMRES method on massively parallel processors

A block varaint of the GMRES method for unsymmetric linear systems

Performance of level 3 BLAS kernels in a dynamically partitioned data-flow environment

Highly nonnormal eigenproblems in the aeronautical industry

Comparisons of Gaussian elimination algorithms on a cray Y-MP

Block-Cholesky for parallel processing

Use of Level 3 Blas in Lu Factorization in a Multiprocessing Environment On Three Vector Multiprocessors: the Alliant Fx/80, the Cray-2, and the Ibm 3090 Vf

Level 3 Blas in Lu Factorization On the Cray-2, Eta-10P, and Ibm 3090-200/Vf