Multi-level spatial and temporal tiling for efficient HPC stencil computation on many-core processors with large shared caches

Charles Yount,Alejandro Duran,Josh Tobin

doi:10.1016/j.future.2017.10.041

Charles Yount, Alejandro Duran + Show 1 more

https://doi.org/10.1016/j.future.2017.10.041

Copy DOI

Abstract

Stencil computation is an important class of algorithms used in a large variety of scientific-simulation applications, especially those arising from finite-difference numerical solutions to differential equations representing the behavior of physical phenomenon such as seismic activity. The performance of stencil calculations is often bounded by memory bandwidth, and such code benefits from vectorization and tiling techniques to reuse data as much as possible once it is loaded from memory. These tiling algorithms are especially crucial for many-core CPU products that contain caches local to the individual cores, and this work provides a review of the use of techniques such as vector-folding and spatial tiling to maximize per-core cache resources. Recent many-core products also include special memory with much higher bandwidth than traditional DDR memory that is intended to provide additional performance for bandwidth-limited applications. On such platforms that also include DDR, the high-bandwidth RAM may be configurable either as separately addressable memory or as a large shared cache for the DDR. Examples of platforms with this feature include those containing products in the Intel® Xeon Phi™ x200 processor family (code-named Knights Landing), which use Multi-Channel DRAM (MCDRAM) technology to provide the higher bandwidth memory resources. In traditional sequential time-step stencil algorithms, the additional bandwidth can most easily be exploited when the stencil data fits into the faster memory, restricting the problem sizes that can be undertaken and under-utilizing the larger DDR memory on the platform. As stencil problem sizes become significantly larger than the fast-memory capacity, the sequential time-step algorithms create an overwhelming number of misses from the fast-memory shared cache, and the effective bandwidth approaches that of the DDR, significantly degrading performance. This paper illustrates this effect and explores the application of temporal wave-front tiling to alleviate it, simultaneously leveraging both the large cache’s bandwidth and the DDR capacity. Two example applications are used to illustrate the optimizations: a single-grid isotropic approximation to the wave equation and a staggered-grid formulation for earthquake simulation. Details of the various tiling algorithms are given for both applications, and results on a Xeon Phi processor are presented, comparing performance across problem sizes and among four experimental configurations. Analyses of the bandwidth utilization and MCDRAM-cache hit rates are provided for one of the example applications, illustrating the correlation between these metrics and performance. It is demonstrated that temporal wave-front tiling can provide up to a 2.4x speedup compared to using the fast-memory cache without temporal tiling and 3.3x speedup compared to only using DDR memory for large problem sizes on the isotropic application. Respective speedups of 1.9x and 2.8x are demonstrated for the staggered-grid application.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multi-level spatial and temporal tiling for efficient HPC stencil computation on many-core processors with large shared caches

Abstract

Talk to us

Similar Papers

More From: Future Generation Computer Systems

Lead the way for us

Journal: Future Generation Computer Systems	Publication Date: Nov 13, 2017
Citations: 15

Similar Papers

Effective use of large high-bandwidth memory caches in HPC stencil computation via temporal wave-front tiling
...
-
, et. al. ...
13 Nov 2016
13 Nov 2016

Effective Use of Large High-Bandwidth Memory Caches in HPC Stencil Computation via Temporal Wave-Front Tiling
Charles Yount ... Alejandro Duran
-
Charles Yount, et. al.Charles Yount ... Alejandro Duran
01 Nov 2016
01 Nov 2016

Exhaustive evaluation of memory-latency sensitivity on manycore processors with large cache
Noboru Tanabe ... Toshio Endo
-
Noboru Tanabe, et. al.Noboru Tanabe ... Toshio Endo
15 Mar 2018
15 Mar 2018

Optimizing for KNL Usage Modes When Data Doesn't Fit in MCDRAM
Neil Butcher ... Jonathan Berry
-
Neil Butcher, et. al.Neil Butcher ... Jonathan Berry
13 Aug 2018
13 Aug 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-level spatial and temporal tiling for efficient HPC stencil computation on many-core processors with large shared caches

Abstract

Talk to us

Similar Papers

More From: Future Generation Computer Systems