Inter-thread Communication Research Articles

PurposeStructural topology optimization is computationally expensive due to the involvement of high-resolution mesh and repetitive use of finite element analysis (FEA) for computing the structural response. Since FEA consumes most of the computational time in each optimization iteration, a novel GPU-based parallel strategy for FEA is presented and applied to the large-scale structural topology optimization of 3D continuum structures.Design/methodology/approachA matrix-free solver based on preconditioned conjugate gradient (PCG) method is proposed to minimize the computational time associated with solution of linear system of equations in FEA. The proposed solver uses an innovative strategy to utilize only symmetric half of elemental stiffness matrices for implementation of the element-by-element matrix-free solver on GPU.FindingsUsing solid isotropic material with penalization (SIMP) method, the proposed matrix-free solver is tested over three 3D structural optimization problems that are discretized using all hexahedral structured and unstructured meshes. Results show that the proposed strategy demonstrates 3.1× –3.3× speedup for the FEA solver stage and overall speedup of 2.9× –3.3× over the standard element-by-element strategy on the GPU. Moreover, the proposed strategy requires almost 1.8× less GPU memory than the standard element-by-element strategy.Originality/valueThe proposed GPU-based matrix-free element-by-element solver takes a more general approach to the symmetry concept than previous works. It stores only symmetric half of the elemental matrices in memory and performs matrix-free sparse matrix-vector multiplication (SpMV) without any inter-thread communication. A customized data storage format is also proposed to store and access only symmetric half of elemental stiffness matrices for coalesced read and write operations on GPU over the unstructured mesh.

Read full abstract

The solution of large-scale combustion problems with codes such as Uintah on modern computer architectures requires the use of multithreading and GPUs to achieve performance. Uintah uses a low-Mach number approximation that requires iteratively solving a large system of linear equations. The Hypre iterative solver has solved such systems in a scalable way for Uintah, but the use of OpenMP with Hypre leads to at least 2× slowdown due to OpenMP overheads. The proposed solution uses the MPI Endpoints within Hypre, where each team of threads acts as a different MPI rank. This approach minimizes OpenMP synchronization overhead and performs as fast or (up to 1.44×) faster than Hypre's MPI-only version, and allows the rest of Uintah to be optimized using OpenMP. The profiling of the GPU version of Hypre shows the bottleneck to be the launch overhead of thousands of micro-kernels. The GPU performance was improved by fusing these micro-kernels and was further optimized by using Cuda-aware MPI, resulting in an overall speedup of 1.16—1.44× compared to the baseline GPU implementation.The above optimization strategies were published in the International Conference on Computational Science 2020 [1]. This work extends the previously published research by carrying out the second phase of communication-centered optimizations in Hypre to improve its scalability on large-scale supercomputers. This includes an efficient non-blocking inter-thread communication scheme, communication-reducing patch assignment, and expression of logical communication parallelism to a new version of the MPICH library that utilizes the underlying network parallelism [2]. The above optimizations avoid communication bottlenecks previously observed during strong scaling and improve performance by up to 2× on 256 nodes of Intel Knight's Landing processor.

Read full abstract

Inter-thread Communication Research Articles

Related Topics

Articles published on Inter-thread Communication

Accurate Data Race Prediction in the Linux Kernel through Sparse Fourier Learning

Managing Concurrent Queues for Efficient In- Vehicle Gateways

Acceleration of structural topology optimization using symmetric element-by-element strategy for unstructured meshes on GPU

Parallel Greedy Algorithm to Multiple Influence Maximization in Social Network

Optimizing the hypre solver for manycore and GPU architectures

An efficient multi-threaded memory allocator for PDES applications

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

FA-Stack: A Fast Array-Based Stack with Wait-Free Progress Guarantee

Hardware Multithreaded Transactions

GPU Implementation of Bitplane Coding with Parallel Coefficient Processing for High Performance Image Compression

Parallel Acoustic Field Simulation with Respect to Scattering of Sound on Local Inhomogeneities

Mostly-optimistic concurrency control for highly contended dynamic workloads on a thousand cores

Benchmarking weak memory models

Comparison of Data Partitioning Schema of Parallel Pairwise Alignment on Shared Memory System

CommGuard

CommGuard

Inter-thread communication efficiency

User satisfaction aware routing and energy modeling of polymorphic network on chip architecture

Real-world design and evaluation of compiler-managed GPU redundant multithreading

Multi-core implementation of the differential ant-stigmergy algorithm for numerical optimization

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Inter-thread Communication Research Articles

Related Topics

Articles published on Inter-thread Communication

Accurate Data Race Prediction in the Linux Kernel through Sparse Fourier Learning

Managing Concurrent Queues for Efficient In- Vehicle Gateways

Acceleration of structural topology optimization using symmetric element-by-element strategy for unstructured meshes on GPU

Parallel Greedy Algorithm to Multiple Influence Maximization in Social Network

Optimizing the hypre solver for manycore and GPU architectures

An efficient multi-threaded memory allocator for PDES applications

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

FA-Stack: A Fast Array-Based Stack with Wait-Free Progress Guarantee

Hardware Multithreaded Transactions

GPU Implementation of Bitplane Coding with Parallel Coefficient Processing for High Performance Image Compression

Parallel Acoustic Field Simulation with Respect to Scattering of Sound on Local Inhomogeneities

Mostly-optimistic concurrency control for highly contended dynamic workloads on a thousand cores

Benchmarking weak memory models

Comparison of Data Partitioning Schema of Parallel Pairwise Alignment on Shared Memory System

CommGuard

CommGuard

Inter-thread communication efficiency

User satisfaction aware routing and energy modeling of polymorphic network on chip architecture

Real-world design and evaluation of compiler-managed GPU redundant multithreading

Multi-core implementation of the differential ant-stigmergy algorithm for numerical optimization