Unified Memory Research Articles

Solving an N-body problem is a computationally quite demanding task in many scientific fields ranging from astrophysics to biomolecular simulations. However, the direct solution scales with O(N2); hence, even on modern hardware, the direct calculation becomes impractical even for moderate numbers of particles, and thus, efficient yet accurate approximations are key. For biomolecular simulations, a widely used such method is Particle Mesh Ewald (PME), which scales with O(N log N). Although extremely fast on a single node, PME runs into communication bottlenecks when parallelized for large simulation systems on many nodes. The fast multipole method (FMM) offers an attractive alternative: It requires less communication and reduces the complexity to optimal O(N). The method approximates long-range interactions by grouping the particles into clusters represented as multipoles. The cluster size grows with the interaction distance according to the underlying octree structure. Hence, further separated particles require fewer interaction computations, hence less communication. Here, we present our full NVIDIA CUDA FMM implementation, which has been optimized for the electrostatic interactions described by Coulomb's law relevant to molecular dynamics simulations. We compare different parallelization approaches to the computationally limiting part of the algorithm, the Multipole-to-Local (M2L) operator, and discuss their performance bottlenecks. The first approach can be implemented with only minimal modifications to the sequential CPU implementation. It features the Unified Memory concept, which allows for a simple utilization of the existing CPU data structures. The second approach enhances the achieved performance by exploiting CUDA Dynamic Parallelism. It introduces a significant speedup, especially for a high accuracy requirement. The third parallelization approach abstracts the underlying octree with precomputed interaction lists and it exploits operator symmetries to achieve nearly optimal performance in the whole tested accuracy range.

Read full abstract

AbstractThis article presents a novel, scalable parallel computing framework for large‐scale and multiscale simulations of granular media. Key to the new framework is an innovative thread‐block‐wise representative volume element (RVE) parallelism, inspired by the resemblance between a typical multiscale computational hierarchy and the hierarchical thread structure of graphics processing units (GPUs). To solve a hierarchical multiscale problem, all computation in an RVE is assigned a single block of threads so that the RVE runs entirely on a GPU to avoid frequent data exchange with the host CPU. The thread blocks can meanwhile run in an asynchronization mode, which implicitly guarantees the independence of inter‐RVE computation as featured by the hierarchical multiscale structure. The parallel computing algorithms are formulated and implemented in an in‐house code, GoDEM, involving the GPU‐specific techniques such as coalesced access, shared memory utilization, and unified memory implementation. Benchmark and performance tests are conducted against an open‐source CPU‐based DEM code under three typical loading conditions. The performance of GoDEM is examined with varying thread‐block size and register pressure of the GPU, and RVE number. It reveals that increasing GPU occupancy by decreasing register pressure results in a significant degradation rather than improvement in performance. We further demonstrate that the proposed GPU parallelism framework may achieve a saturated speedup of approximately 350 compared with the single‐CPU‐core code. As a demonstration on its application for multiscale modeling of granular media, the material point method is coupled with the new framework powered DEM to simulate a typical engineering‐scale problem involving tens of millions of total particles having to be handled. It demonstrates that a speedup of approximately 91 can be achieved by using the proposed framework, compared with the performance of a similar CPU program running on a cluster node of 44 parallel threads. The study offers a viable future solution to large‐scale and multiscale modeling of granular media.

Read full abstract

Unified Memory Research Articles

Related Topics

Articles published on Unified Memory

Unified Holistic Memory Management Supporting Multiple Big Data Processing Frameworks over Hybrid Memories

Efficient Use of GPU Memory for Large-Scale Deep Learning Model Training

Multi-GPU systems and Unified Virtual Memory for scientific applications: The case of the NAS multi-zone parallel benchmarks

Efficient ROS-Compliant CPU-iGPU Communication on Embedded Platforms

Accelerating On-Device Learning with Layer-Wise Processor Selection Method on Unified Memory.

An annotation-free whole-slide training approach to pathological classification of lung cancer types using deep learning

Grus

A Cuda Fast Multipole Method with Highly Efficient M2L Far Field Evaluation

Performance of CUDA Unified Memory in CMS Heterogeneous Pixel Reconstruction

Second Kings 24–25 and Jeremiah 52 as Diverging and Converging Memories of the Babylonian Conquest

High throughput token driven FSM based regex pattern matching for network intrusion detection system

SA-JSTN: Self-Attention Joint Spatiotemporal Network for Temperature Forecasting

Optimizing Data Pipeline Performance in Modern GPU Architectures

IMLBench: A Machine Learning Benchmark Suite for CPU-GPU Integrated Architectures

Efficient Buffer Overflow Detection on GPU

A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPU

Jittor: a novel deep learning framework with meta-operators and unified graph execution

A CUDA fast multipole method with highly efficient M2L far field evaluation

A thread‐block‐wise computational framework for large‐scale hierarchical continuum‐discrete modeling of granular media

Enabling Latency-Aware Data Initialization for Integrated CPU/GPU Heterogeneous Platform

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Unified Memory Research Articles

Related Topics

Articles published on Unified Memory

Unified Holistic Memory Management Supporting Multiple Big Data Processing Frameworks over Hybrid Memories

Efficient Use of GPU Memory for Large-Scale Deep Learning Model Training

Multi-GPU systems and Unified Virtual Memory for scientific applications: The case of the NAS multi-zone parallel benchmarks

Efficient ROS-Compliant CPU-iGPU Communication on Embedded Platforms

Accelerating On-Device Learning with Layer-Wise Processor Selection Method on Unified Memory.

An annotation-free whole-slide training approach to pathological classification of lung cancer types using deep learning

Grus

A Cuda Fast Multipole Method with Highly Efficient M2L Far Field Evaluation

Performance of CUDA Unified Memory in CMS Heterogeneous Pixel Reconstruction

Second Kings 24–25 and Jeremiah 52 as Diverging and Converging Memories of the Babylonian Conquest

High throughput token driven FSM based regex pattern matching for network intrusion detection system

SA-JSTN: Self-Attention Joint Spatiotemporal Network for Temperature Forecasting

Optimizing Data Pipeline Performance in Modern GPU Architectures

IMLBench: A Machine Learning Benchmark Suite for CPU-GPU Integrated Architectures

Efficient Buffer Overflow Detection on GPU

A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPU

Jittor: a novel deep learning framework with meta-operators and unified graph execution

A CUDA fast multipole method with highly efficient M2L far field evaluation

A thread‐block‐wise computational framework for large‐scale hierarchical continuum‐discrete modeling of granular media

Enabling Latency-Aware Data Initialization for Integrated CPU/GPU Heterogeneous Platform