Limited Memory Bandwidth Research Articles

Abstract. Ice-sheet flow models capable of accurately projecting their future mass balance constitute tools to improve flood risk assessment and assist sea-level rise mitigation associated with enhanced ice discharge. Some processes that need to be captured, such as grounding-line migration, require high spatial resolution (under the kilometer scale). Conventional ice flow models mainly execute on central processing units (CPUs), which feature limited parallel processing capabilities and peak memory bandwidth. This may hinder model scalability and result in long run times, requiring significant computational resources. As an alternative, graphics processing units (GPUs) are ideally suited for high spatial resolution, as the calculations can be performed concurrently by thousands of threads, processing most of the computational domain simultaneously. In this study, we combine a GPU-based approach with the pseudo-transient (PT) method, an accelerated iterative and matrix-free solution strategy, and investigate its performance for finite elements and unstructured meshes with application to two-dimensional (2-D) models of real glaciers at a regional scale. For both the Jakobshavn and Pine Island glacier models, the number of nonlinear PT iterations required to converge a given number of vertices (N) scales in the order of 𝒪(N1.2) or better. We further compare the performance of the PT CUDA C implementation with a standard finite-element CPU-based implementation using the price-to-performance metric. The price of a single Tesla V100 GPU is 1.5 times that of two Intel Xeon Gold 6140 CPUs. We expect a minimum speedup of at least 1.5 times to justify the Tesla V100 GPU price to performance. Our developments result in a GPU-based implementation that achieves this goal with a speedup beyond 1.5 times. This study represents a first step toward leveraging GPU processing power, enabling more accurate polar ice discharge predictions. The insights gained will benefit efforts to diminish spatial resolution constraints at higher computing performance. The higher computing performance will allow for ensembles of ice-sheet flow simulations to be run at the continental scale and higher resolution, a previously challenging task. The advances will further enable the quantification of model sensitivity to changes in upcoming climate forcings. These findings will significantly benefit process-oriented sea-level-projection studies over the coming decades.

Read full abstract

Multiphysics applications often require the use of intimately coupled solvers. The application studied here makes use of an Eulerian solver to model fluid flow and combustion and a Lagrangian solver to model spray droplets. These are then implemented within one code to solve gas turbine combustion problems. However, large scale simulations where the flow and spray are within the same computational process can be expensive as the parallel solution does not scale well due to the poor load balancing of the spray particles. This is overcome by an asynchronous task-based Eulerian-Lagrangian (ATEL) approach where separate computational processes are used so that each solver can use an appropriate technique to partition the problem. Previously, this has been shown to overcome the load balancing problem but was restricted to a single computational node where shared memory could be used to transfer data. This work expands the methodology to work on large scale HPC facilities using a combination of shared memory and high speed interconnect to transfer data. The parallel methodology exploits one-sided shared memory communication when the corresponding processes are located within a computer node, otherwise it falls back to a conventional pair of send/receive. Also an hierarchical partitioning procedure is proposed that ensures that groups of parallel subdomains with high connectivity are placed on a compute node. Results are shown for two combustor cases: the DLR generic single sector combustor with an injection process that resembles a prefilming airblast atomiser which is found in many modern civil aircraft engines and a bluff-body swirl burner with a single source of fuel injection resembling a pressure atomiser. Both single sector and three sector combustor configurations have been used to carry out the performance studies. All performance cases have been tested with three different solver configurations: a) base-line Eulerian-Lagrangian solver b) ATEL and c) the baseline Eulerian solver without spray. The unstructured grids varied from 7M cells to 84M cells. In all cases the ATEL solution with flow, combustion and spray scaled identically to when solving flow and combustion alone. In fact, due to the memory bandwidth limitations of multicore processors, reducing the number of cores allocated to the flow and combustion to allocate some cores for the spray, hardly affected the computational speed of the flow solution, and due to the overlap of the spray calculation meant that the coupled Eulerian-Lagrangian solution could be achieved at almost no cost penalty to the Eulerian on its own. The choice of how to split the cores across the two solvers was considered by proposing a simple model to estimate the cost of each solver. Timing measurements show, that for the cases considered, the overall computational time is only weakly sensitive to this choice.

Read full abstract

Limited Memory Bandwidth Research Articles

Related Topics

Articles published on Limited Memory Bandwidth

PIMCoSim: Hardware/Software Co-Simulator for Exploring Processing-in-Memory Architectures

Data Pruning-enabled High Performance and Reliable Graph Neural Network Training on ReRAM-based Processing-in-Memory Accelerators

A TabPFN-based intrusion detection system for the industrial internet of things

SWattention: designing fast and memory-efficient attention for a new Sunway Supercomputer

Strengthening IoT Network Protocols: A Model Resilient Against Cyber Attacks

Graphics-processing-unit-accelerated ice flow solver for unstructured meshes using the Shallow-Shelf Approximation (FastIceFlo v1.0.1)

A novel time-domain in-memory computing unit using STT-MRAM

Enabling memory access isolation in real-time cloud systems using Intel’s detection/regulation capabilities

16-Bit (4 × 4) Optical Random Access Memory (RAM) Bank

PISA-DMA: Processing-in-Memory Instruction Set Architecture Using DMA

CoMeT: An Integrated Interval Thermal Simulation Toolchain for 2D, 2.5D, and 3D Processor-Memory Systems

A Survey of Network-Based Hardware Accelerators

Asynchronous task based Eulerian-Lagrangian parallel solver for combustion applications

DLUX: A LUT-Based Near-Bank Accelerator for Data Center Deep Learning Training Workloads

Design of an energy-efficient binarized convolutional neural network accelerator using a nonvolatile field-programmable gate array with only-once-write shifting

Evaluation of Static Mapping for Dynamic Space-Shared Multi-task Processing on FPGAs

Skyrmion Logic-In-Memory Architecture for Maximum/Minimum Search

Development of compression algorithms for hyperspectral aerospace images based on discrete orthogonal transformations

Improving Network Slimming With Nonconvex Regularization

Optimizing Data Pipeline Performance in Modern GPU Architectures

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Limited Memory Bandwidth Research Articles

Related Topics

Articles published on Limited Memory Bandwidth

PIMCoSim: Hardware/Software Co-Simulator for Exploring Processing-in-Memory Architectures

Data Pruning-enabled High Performance and Reliable Graph Neural Network Training on ReRAM-based Processing-in-Memory Accelerators

A TabPFN-based intrusion detection system for the industrial internet of things

SWattention: designing fast and memory-efficient attention for a new Sunway Supercomputer

Strengthening IoT Network Protocols: A Model Resilient Against Cyber Attacks

Graphics-processing-unit-accelerated ice flow solver for unstructured meshes using the Shallow-Shelf Approximation (FastIceFlo v1.0.1)

A novel time-domain in-memory computing unit using STT-MRAM

Enabling memory access isolation in real-time cloud systems using Intel’s detection/regulation capabilities

16-Bit (4 × 4) Optical Random Access Memory (RAM) Bank

PISA-DMA: Processing-in-Memory Instruction Set Architecture Using DMA

CoMeT: An Integrated Interval Thermal Simulation Toolchain for 2D, 2.5D, and 3D Processor-Memory Systems

A Survey of Network-Based Hardware Accelerators

Asynchronous task based Eulerian-Lagrangian parallel solver for combustion applications

DLUX: A LUT-Based Near-Bank Accelerator for Data Center Deep Learning Training Workloads

Design of an energy-efficient binarized convolutional neural network accelerator using a nonvolatile field-programmable gate array with only-once-write shifting

Evaluation of Static Mapping for Dynamic Space-Shared Multi-task Processing on FPGAs

Skyrmion Logic-In-Memory Architecture for Maximum/Minimum Search

Development of compression algorithms for hyperspectral aerospace images based on discrete orthogonal transformations

Improving Network Slimming With Nonconvex Regularization

Optimizing Data Pipeline Performance in Modern GPU Architectures