Stencil Computations Research Articles

Stencil computations form the basis for computer simulations across almost every field of science, such as computational fluid dynamics, data mining, and image processing. Their mostly regular data access patterns potentially enable them to take advantage of the high computation and data bandwidth of GPUs, but only if data buffering and other issues are handled properly. Finding a good code generation strategy presents a number of challenges, one of which is the best way to make use of memory. GPUs have several types of on-chip storage including registers, shared memory, and a read-only cache. The choice of type of storage and how it’s used, a buffering strategy , for each stencil array ( grid function , [GF]) not only requires a good understanding of its stencil pattern, but also the efficiency of each type of storage for the GF, to avoid squandering storage that would be more beneficial to another GF. For a stencil computation with $N$ GFs, the total number of possible assignments is $\beta ^{N}$ where $\beta$ is the number of buffering strategies. Our code-generation framework supports five buffering strategies ( $\beta =5$ ). Large, complex stencil kernels may consist of dozens of GFs, resulting in significant search overhead. In this work, we present an analytic performance model for stencil computations on GPUs and study the behavior of read-only cache and L2 cache. Next, we propose an efficiency-based assignment algorithm which operates by scoring a change in buffering strategy for a GF using a combination of (a) the predicted execution time and (b) on-chip storage usage. By using this scoring, an assignment for $N$ GFs can be determined in $(\beta -1)N(N+1)/2$ steps. Results show that the performance model has good accuracy and that the assignment strategy is highly efficient.

We present a unified method for numerical evaluation of volume, surface, and path integrals of smooth, bounded functions on implicitly defined bounded domains. The method avoids both the stochastic nature (and slow convergence) of Monte Carlo methods and problem-specific domain decompositions required by most traditional numerical integration techniques. Our approach operates on a uniform grid over an axis-aligned box containing the region of interest, so we refer to it as a grid-based method. All grid-based integrals are computed as a sum of contributions from a stencil computation on the grid points. Each class of integrals (path, surface, or volume) involves a different stencil formulation, but grid-based integrals of a given class can be evaluated by applying the same stencil on the same set of grid points; only the data on the grid points changes. When functions are defined over the continuous domain so that grid refinement is possible, grid-based integration is supported by a convergence proof based on wavelet analysis. Given the foundation of function values on a uniform grid, grid-based integration methods apply directly to data produced by volumetric imaging (including computed tomography and magnetic resonance), direct numerical simulation of fluid flow, or any other method that produces data corresponding to values of a function sampled on a regular grid. Every step of a grid-based integral computation (including evaluating a function on a grid, application of stencils on a grid, and reduction of the contributions from the grid points to a single sum) is well suited for parallelization. We present results from a parallelized CUDA implementation of grid-based integrals that faithfully reproduces the output of a serial implementation but with significant reductions in computing time. We also present example grid-based integral results to quantify convergence rates associated with grid refinement and dependence of the convergence rate on the specific choice of difference stencil (corresponding to a particular genus of Daubechies wavelet).

Stencil Computations Research Articles

Related Topics

Articles published on Stencil Computations

Optimization Approach to Accelerator Codesign

Performance Limits Study of Stencil Codes on Modern GPGPUs

Using Arm’s scalable vector extension on stencil codes

Performance portable parallel programming of heterogeneous stencils across shared-memory platforms with modern Intel processors

PACC: a directive-based programming framework for out-of-core stencil computation on accelerators

PACC: a directive-based programming framework for out-of-core stencil computation on accelerators

Multi-FPGA Accelerator Architecture for Stencil Computation Exploiting Spacial and Temporal Scalability

Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations

Evaluating optimizations that reduce global memory accesses of stencil computations in GPGPUs

Parallelizable adjoint stencil computations using transposed forward-mode algorithmic differentiation

A Code Generator for Energy-Efficient Wavefront Parallelization of Uniform Dependence Computations

Thoroughly Exploring GPU Buffering Options for Stencil Code by Using an Efficiency Measure and a Performance Model

An Autotuning Protocol to Rapidly Build Autotuners

Unleashing the performance of ccNUMA multiprocessor architectures in heterogeneous stencil computations

Extreme-Scale High-Order WENO Simulations of 3-D Detonation Wave with 10 Million Cores

A Strategy for Automatic Performance Tuning of Stencil Computations on GPUs

Optimization of Finite-Differencing Kernels for Numerical Relativity Applications

Reproducible stencil compiler benchmarks using prova!

Locally Recursive Non-Locally Asynchronous Algorithms for Stencil Computation

Treat All Integrals as Volume Integrals: A Unified, Parallel, Grid-Based Method for Evaluation of Volume, Surface, and Path Integrals on Implicitly Defined Domains.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Stencil Computations Research Articles

Related Topics

Articles published on Stencil Computations

Optimization Approach to Accelerator Codesign

Performance Limits Study of Stencil Codes on Modern GPGPUs

Using Arm’s scalable vector extension on stencil codes

Performance portable parallel programming of heterogeneous stencils across shared-memory platforms with modern Intel processors

PACC: a directive-based programming framework for out-of-core stencil computation on accelerators

PACC: a directive-based programming framework for out-of-core stencil computation on accelerators

Multi-FPGA Accelerator Architecture for Stencil Computation Exploiting Spacial and Temporal Scalability

Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations

Evaluating optimizations that reduce global memory accesses of stencil computations in GPGPUs

Parallelizable adjoint stencil computations using transposed forward-mode algorithmic differentiation

A Code Generator for Energy-Efficient Wavefront Parallelization of Uniform Dependence Computations

Thoroughly Exploring GPU Buffering Options for Stencil Code by Using an Efficiency Measure and a Performance Model

An Autotuning Protocol to Rapidly Build Autotuners

Unleashing the performance of ccNUMA multiprocessor architectures in heterogeneous stencil computations

Extreme-Scale High-Order WENO Simulations of 3-D Detonation Wave with 10 Million Cores

A Strategy for Automatic Performance Tuning of Stencil Computations on GPUs

Optimization of Finite-Differencing Kernels for Numerical Relativity Applications

Reproducible stencil compiler benchmarks using prova!

Locally Recursive Non-Locally Asynchronous Algorithms for Stencil Computation

Treat All Integrals as Volume Integrals: A Unified, Parallel, Grid-Based Method for Evaluation of Volume, Surface, and Path Integrals on Implicitly Defined Domains.