Fine-grained Parallelism Research Articles

Sparse Triangular Solve (SpTRSV) has long been an essential kernel in the field of scientific computing. Due to its low computational intensity and internal data dependencies, SpTRSV is hard to implement and optimize on GPUs. Based on our experimental observations, existing implementations on GPUs fail to achieve the optimal performance due to their sub-optimal parallelism setups and code implementations, and lack of consideration of the irregular data distribution. Moreover, their algorithm design lacks the adaptability to different input matrices, which may involve substantial manual efforts of algorithm redesigning and parameter tuning for performance consistency. In this work, we propose AG-SpTRSV, an automatic framework to optimize SpTRSV on GPUs, which provides high performance on various matrices while eliminating the costs of manual design. AG-SpTRSV abstracts the procedures of optimizing an SpTRSV kernel as a scheme and constructs a comprehensive optimization space based on it. By defining a unified code template and preparing code variants, AG-SpTRSV enables fine-grained dynamic parallelism and adaptive code optimizations to handle various tasks. Through computation graph transformation and multi-hierarchy heuristic scheduling, AG-SpTRSV generates schemes for task partitioning and mapping, which effectively address the issues of irregular data distribution and internal data dependencies. AG-SpTRSV searches for the best scheme to optimize the target kernel for the specific matrix. A learned lightweight performance model is also introduced to reduce search costs and provide an efficient end-to-end solution. Experimental results with SuiteSparse Matrix Collection on NVIDIA Tesla A100 and RTX 3080 Ti show that AG-SpTRSV outperforms state-of-the-art implementations with geometric average speedups of 2.12x ∼ 3.99x. With the performance model enabled, AG-SpTRSV can provide an efficient end-to-end solution, with preprocessing times ranging from 3.4 to 245 times of the execution time.

Read full abstract

Processors with 100s of threads of execution are among the state-of-the-art in high-end computing systems. This transition to many-core computing has required the community to develop new algorithms to overcome significant latency bottlenecks through massive concurrency. However, implementing efficient parallel runtimes that can scale up to high concurrency levels with extremely fine-grained tasks remains a challenge. Existing techniques do not scale to a large number of threads due to the high cost of synchronization in concurrent data structures. We present a thorough analysis of various synchronization mechanisms including mutex, semaphore, spinlock and atomic fetch-and-add that are typically used to build concurrent data structures in task-parallel runtime systems. To overcome these limitations, in a recent work we proposed XQueue, a novel lock-less concurrent queuing system with relaxed ordering semantics that is geared towards realizing scalability up to hundreds of concurrent threads. In this work, we extend XQueue and present X-OpenMP, a library for enabling extremely fine-grained parallelism on modern many-core systems with hundreds of cores. Work stealing is a popular choice for load balancing in task-based runtime systems as it efficiently distributes the load across worker threads; however, traditional approaches rely on synchronization primitives and thus work stealing can incur overheads. Here we implement a lock-less algorithm for work stealing for total-store order (TSO) memory architectures and evaluate the performance using micro and macro benchmarks. We compare the performance of X-OpenMP with native LLVM OpenMP, GNU OpenMP, OpenCilk and oneTBB implementations using task-based linear algebra routines from PLASMA numerical library, Strassen’s matrix multiplication from the BOTS Benchmark Suite, and the Unbalanced Tree Search benchmark. Applications parallelized using OpenMP can run without modification by simply linking against the X-OpenMP library. X-OpenMP achieves up to 40X speedup compared to GNU OpenMP, up to 2X speedup compared to the native LLVM OpenMP, up to 6X speedup compared to OpenCilk and up to 5X speedup compared to oneTBB implementations. The tasking overheads in X-OpenMP are reduced by 50% compared to the native LLVM OpenMP.

Read full abstract

Fine-grained Parallelism Research Articles

Related Topics

Articles published on Fine-grained Parallelism

On a Simplified Approach to Achieve Parallel Performance and Portability Across CPU and GPU Architectures

STORMM: Structure and topology replica molecular mechanics for chemical simulations.

AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUs

Topo: Towards a fine-grained topological data processing framework on Tianhe-3 supercomputer

X-OpenMP — eXtreme fine-grained tasking using lock-less work stealing

Parallel interior-point solver for block-structured nonlinear programs on SIMD/GPU architectures

GPU Algorithms for Structured Sparse Matrix Multiplication with Diagonal Storage Schemes

A parallel Canny edge detection algorithm based on OpenCL acceleration.

PipeSFL: A Fine-Grained Parallelization Framework for Split Federated Learning on Heterogeneous Clients

Flip : Data-centric Edge CGRA Accelerator

Fine-grained adaptive parallelism for automotive systems through AMALTHEA and OpenMP

Fast Parallel Algorithms for Enumeration of Simple, Temporal, and Hop-constrained Cycles

Heterogeneous programming using OpenMP and CUDA/HIP for hybrid CPU-GPU scientific applications

Improving Structured Grid-Based Sparse Matrix-Vector Multiplication and Gauss–Seidel Iteration on GPDSP

An efficient three-dimensional numerical simulation of particle acoustic agglomeration with fine-grained parallelization on graphical processing unit

A parallel particle swarm optimization algorithm based on GPU/CUDA

HipBone: A performance-portable graphics processing unit-accelerated C++ version of the NekBone benchmark

Code modernization strategies for short-range non-bonded molecular dynamics simulations

Parametric Optimization on HPC Clusters with Geneva

Parallel improved DPSA algorithm for medium-term optimal scheduling of large-scale cascade hydropower plants

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Fine-grained Parallelism Research Articles

Related Topics

Articles published on Fine-grained Parallelism

On a Simplified Approach to Achieve Parallel Performance and Portability Across CPU and GPU Architectures

STORMM: Structure and topology replica molecular mechanics for chemical simulations.

AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUs

Topo: Towards a fine-grained topological data processing framework on Tianhe-3 supercomputer

X-OpenMP — eXtreme fine-grained tasking using lock-less work stealing

Parallel interior-point solver for block-structured nonlinear programs on SIMD/GPU architectures

GPU Algorithms for Structured Sparse Matrix Multiplication with Diagonal Storage Schemes

A parallel Canny edge detection algorithm based on OpenCL acceleration.

PipeSFL: A Fine-Grained Parallelization Framework for Split Federated Learning on Heterogeneous Clients

Flip : Data-centric Edge CGRA Accelerator

Fine-grained adaptive parallelism for automotive systems through AMALTHEA and OpenMP

Fast Parallel Algorithms for Enumeration of Simple, Temporal, and Hop-constrained Cycles

Heterogeneous programming using OpenMP and CUDA/HIP for hybrid CPU-GPU scientific applications

Improving Structured Grid-Based Sparse Matrix-Vector Multiplication and Gauss–Seidel Iteration on GPDSP

An efficient three-dimensional numerical simulation of particle acoustic agglomeration with fine-grained parallelization on graphical processing unit

A parallel particle swarm optimization algorithm based on GPU/CUDA

HipBone: A performance-portable graphics processing unit-accelerated C++ version of the NekBone benchmark

Code modernization strategies for short-range non-bonded molecular dynamics simulations

Parametric Optimization on HPC Clusters with Geneva

Parallel improved DPSA algorithm for medium-term optimal scheduling of large-scale cascade hydropower plants