Basic Linear Algebra Subprograms Routines Research Articles

We present an approach for integrating the time evolution of quantum systems. We leverage the computation power of graphics processing units (GPUs) to perform the integration of all time steps in parallel. The performance boost is especially prominent for small to medium-sized quantum systems. The devised algorithm can largely be implemented using the recently-specified batched versions of the BLAS routines, and can therefore be easily ported to a variety of platforms. Our PARAllelized Matrix Exponentiation for Numerical Time evolution (PARAMENT) implementation runs on CUDA-enabled graphics processing units. Program summaryProgram Title: PARAMENTCPC Library link to program files:https://doi.org/10.17632/zy5v4xs89d.1Developer's repository link:https://github.com/parament-integrator/paramentLicensing provisions: Apache 2.0Programming language: C / CUDA / PythonNature of problem: Time-integration of the Schrödinger equation with a time-dependent Hamiltonian for quantum systems with a small Hilbert space but many time-steps.Solution method: A 4th order Magnus integrator, highly parallelized on a GPU, implemented using a small subset of BLAS functions for improved portability.

Programmers usually implement iterative methods that solve partial differential equations by expressing them using a sequence of basic kernels from libraries optimized for the graphics processing unit (GPU). The global runtime of the resulting combination is often penalized by the smallest and most inefficient vector operations. To improve the GPU exploitation, we identify and analyze the potential kernels to be fused according to the data dependence, data type and size, and GPU resources. This paper provides an extensive analysis of the impact of fusing vector operations [level 1 of Basic Linear Algebra Subprograms (BLAS)] on the performance of the GPU. The experimental evaluation shows that this optimization provides noticeable improvement especially for kernels with lower memory requirements and on more modern GPUs. It is worth noting that the fused BLAS operations can be very useful to help programmers efficiently code iterative methods to solve large linear systems of equations for the GPU. Iterative methods such as biconjugate gradient method (BCG) are one of the examples that can benefit from this optimization strategy. Indeed, kernel fusion of vector routines makes the most efficient GPU implementation of BCG run between $$1.09\times $$ 1.09 × and $$1.27\times $$ 1.27 × faster on three GPUs of different characteristics.

Basic Linear Algebra Subprograms Routines Research Articles

Articles published on Basic Linear Algebra Subprograms Routines

BLAS Kütüphanelerinin GPU Mimarilerindeki Nicel Performans Analizi

Parallel time integration using Batched BLAS (Basic Linear Algebra Subprograms) routines

Integration and exploitation of intra-routine malleability in BLIS

Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study

High performance BLAS formulation of the adaptive Fast Multipole Method

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Basic Linear Algebra Subprograms Routines Research Articles

Articles published on Basic Linear Algebra Subprograms Routines

BLAS Kütüphanelerinin GPU Mimarilerindeki Nicel Performans Analizi

Parallel time integration using Batched BLAS (Basic Linear Algebra Subprograms) routines

Integration and exploitation of intra-routine malleability in BLIS

Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study

High performance BLAS formulation of the adaptive Fast Multipole Method