GPU Implementation Research Articles

We present the GPU implementation efforts and challenges of the sparse solver package STRUMPACK. The code is made publicly available on github with a permissive BSD license. STRUMPACK implements an approximate multifrontal solver, a sparse LU factorization which makes use of compression methods to accelerate time to solution and reduce memory usage. Multiple compression schemes based on rank-structured and hierarchical matrix approximations are supported, including hierarchically semi-separable, hierarchically off-diagonal butterfly, and block low rank. In this paper, we present the GPU implementation of the block low rank (BLR) compression method within a multifrontal solver. Our GPU implementation relies on highly optimized vendor libraries such as cuBLAS and cuSOLVER for NVIDIA GPUs, rocBLAS and rocSOLVER for AMD GPUs and the Intel oneAPI Math Kernel Library (oneMKL) for Intel GPUs. Additionally, we rely on external open source libraries such as SLATE (Software for Linear Algebra Targeting Exascale), MAGMA (Matrix Algebra on GPU and Multi-core Architectures), and KBLAS (KAUST BLAS). SLATE is used as a GPU-capable ScaLAPACK replacement. From MAGMA we use variable sized batched dense linear algebra operations such as GEMM, TRSM and LU with partial pivoting. KBLAS provides efficient (batched) low rank matrix compression for NVIDIA GPUs using an adaptive randomized sampling scheme. The resulting sparse solver and preconditioner runs on NVIDIA, AMD and Intel GPUs. Interfaces are available from PETSc, Trilinos and MFEM, or the solver can be used directly in user code. We report results for a range of benchmark applications, using the Perlmutter system from NERSC, Frontier from ORNL, and Aurora from ALCF. For a high frequency wave equation on a regular mesh, using 32 Perlmutter compute nodes, the factorization phase of the exact GPU solver is about 6.5× faster compared to the CPU-only solver. The BLR-enabled GPU solver is about 13.8× faster than the CPU exact solver. For a collection of SuiteSparse matrices, the STRUMPACK exact factorization on a single GPU is on average 1.9× faster than NVIDIA’s cuDSS solver.

Massive MIMO (Multiple Input Multiple Output) systems impose significant processing burdens along with strict latency requirements. The combination of large-scale antenna arrays and wide bandwidth requirements for next-generation wireless systems creates an exponential increase in frontend to backend data. Balancing the processing latency and reliability is critical for baseband processing tasks such as QAM detection. While linear detection algorithms have low computational complexity, their use in Massive MIMO scenario has heavy degradation in error performance. Nonlinear detection methods such as Maximum Likelihood and Sphere Decoding have good error performance, but they suffer from high, variable, and uncontrollable computational complexity. For such cases, the K-best QAM detection algorithm can provide required control over the system performance while maintaining near-ML error performance. In this paper, hard-output, as well as soft-output K-best QAM detection, is implemented in a CPU by utilizing the multiple cores combined with vector processing. Similarly, hard-output detection in a GPU is implemented by leveraging the SIMD (Single Instruction, Multiple Data) architecture and Warp-based execution model. The processing time per bit and the energy consumption per bit are compared for CPU and GPU implementations for QAM constellation density and MIMO array size. The GPU implementation shows up to 5× processing latency per bit improvement and up to 120× energy consumption per bit improvement over the CPU implementation for typical QAM constellations such as 4, 16, and 64 QAM. GPU implementation also shows up to 125× improvement over CPU implementation in energy consumption per bit for larger MIMO configurations such as 24 × 24 and 32 × 32. Finally, the soft-output detector is combined with a LDPC (Low-Density Parity Check) decoder to obtain the FER (Frame Error Rate) performance for CPU implementation. The FER is then combined with frame processing latency to form a Goodput metric to demonstrate the latency and reliability tradeoff.

GPU Implementation Research Articles

Related Topics

Articles published on GPU Implementation

Performance Study of an MRI Motion-Compensated Reconstruction Program on Intel CPUs, AMD EPYC CPUs, and NVIDIA GPUs

VAN-DAMME: GPU-accelerated and symmetry-assisted quantum optimal control of multi-qubit systems

Dimensionality Reduction for the Real-Time Light-Field View Synthesis of Kernel-Based Models

A parallel recursive framework for modelling time series

A flexible and fast digital twin for RRAM systems applied for training resilient neural networks

A graphics processing unit accelerated sparse direct solver and preconditioner with block low rank compression

An implementation of GPU accelerated mapreduce: using hadoop with openCL for breast cancer detection and compute-intensive jobs

Are 2D shallow-water solvers fast enough for early flood warning? A comparative assessment on the 2021 Ahr valley flood event

A Novel Low-Complexity and Parallel Algorithm for DCT IV Transform and Its GPU Implementation

PriorMSM: An Efficient Acceleration Architecture for Multi-Scalar Multiplication

Single neuromorphic memristor closely emulates multiple synaptic mechanisms for energy efficient neural networks

Projective Peridynamic Modeling of Hyperelastic Membranes With Contact.

IKPLS: Improved Kernel Partial Least Squares and Fast Cross-Validation Algorithms for Python with CPU and GPU Implementations Using NumPy and JAX

Enhancing SILCS-MC via GPU Acceleration and Ligand Conformational Optimization with Genetic and Parallel Tempering Algorithms.

GPU implementation of the Frenet Path Planner for embedded autonomous systems: A case study in the F1tenth scenario

Parallel Implementation of K-Best Quadrature Amplitude Modulation Detection for Massive Multiple Input Multiple Output Systems

A Scalable Accelerator for Local Score Computation of Structure Learning in Bayesian Networks

Robust and effective ab initio molecular dynamics simulations on the GPU cloud infrastructure using the Schrödinger Materials Science Suite

Design and Hardware Implementation of CNN-GCN Model for Skeleton-Based Human Action Recognition

Exact Analytical Algorithm for the Solvent-Accessible Surface Area and Derivatives in Implicit Solvent Molecular Simulations on GPUs.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

GPU Implementation Research Articles

Related Topics

Articles published on GPU Implementation

Performance Study of an MRI Motion-Compensated Reconstruction Program on Intel CPUs, AMD EPYC CPUs, and NVIDIA GPUs

VAN-DAMME: GPU-accelerated and symmetry-assisted quantum optimal control of multi-qubit systems

Dimensionality Reduction for the Real-Time Light-Field View Synthesis of Kernel-Based Models

A parallel recursive framework for modelling time series

A flexible and fast digital twin for RRAM systems applied for training resilient neural networks

A graphics processing unit accelerated sparse direct solver and preconditioner with block low rank compression

An implementation of GPU accelerated mapreduce: using hadoop with openCL for breast cancer detection and compute-intensive jobs

Are 2D shallow-water solvers fast enough for early flood warning? A comparative assessment on the 2021 Ahr valley flood event

A Novel Low-Complexity and Parallel Algorithm for DCT IV Transform and Its GPU Implementation

PriorMSM: An Efficient Acceleration Architecture for Multi-Scalar Multiplication

Single neuromorphic memristor closely emulates multiple synaptic mechanisms for energy efficient neural networks

Projective Peridynamic Modeling of Hyperelastic Membranes With Contact.

IKPLS: Improved Kernel Partial Least Squares and Fast Cross-Validation Algorithms for Python with CPU and GPU Implementations Using NumPy and JAX

Enhancing SILCS-MC via GPU Acceleration and Ligand Conformational Optimization with Genetic and Parallel Tempering Algorithms.

GPU implementation of the Frenet Path Planner for embedded autonomous systems: A case study in the F1tenth scenario

Parallel Implementation of K-Best Quadrature Amplitude Modulation Detection for Massive Multiple Input Multiple Output Systems

A Scalable Accelerator for Local Score Computation of Structure Learning in Bayesian Networks

Robust and effective ab initio molecular dynamics simulations on the GPU cloud infrastructure using the Schrödinger Materials Science Suite

Design and Hardware Implementation of CNN-GCN Model for Skeleton-Based Human Action Recognition

Exact Analytical Algorithm for the Solvent-Accessible Surface Area and Derivatives in Implicit Solvent Molecular Simulations on GPUs.