Abstract

Stencil computation is an important class of algorithms used in a large variety of scientific-simulation applications, especially those arising from finite-difference numerical solutions to differential equations representing the behavior of physical phenomenon such as seismic activity. The performance of stencil calculations is often bounded by memory bandwidth, and such code benefits from vectorization and tiling techniques to reuse data as much as possible once it is loaded from memory. These tiling algorithms are especially crucial for many-core CPU products that contain caches local to the individual cores, and this work provides a review of the use of techniques such as vector-folding and spatial tiling to maximize per-core cache resources. Recent many-core products also include special memory with much higher bandwidth than traditional DDR memory that is intended to provide additional performance for bandwidth-limited applications. On such platforms that also include DDR, the high-bandwidth RAM may be configurable either as separately addressable memory or as a large shared cache for the DDR. Examples of platforms with this feature include those containing products in the Intel® Xeon Phi™ x200 processor family (code-named Knights Landing), which use Multi-Channel DRAM (MCDRAM) technology to provide the higher bandwidth memory resources. In traditional sequential time-step stencil algorithms, the additional bandwidth can most easily be exploited when the stencil data fits into the faster memory, restricting the problem sizes that can be undertaken and under-utilizing the larger DDR memory on the platform. As stencil problem sizes become significantly larger than the fast-memory capacity, the sequential time-step algorithms create an overwhelming number of misses from the fast-memory shared cache, and the effective bandwidth approaches that of the DDR, significantly degrading performance. This paper illustrates this effect and explores the application of temporal wave-front tiling to alleviate it, simultaneously leveraging both the large cache’s bandwidth and the DDR capacity. Two example applications are used to illustrate the optimizations: a single-grid isotropic approximation to the wave equation and a staggered-grid formulation for earthquake simulation. Details of the various tiling algorithms are given for both applications, and results on a Xeon Phi processor are presented, comparing performance across problem sizes and among four experimental configurations. Analyses of the bandwidth utilization and MCDRAM-cache hit rates are provided for one of the example applications, illustrating the correlation between these metrics and performance. It is demonstrated that temporal wave-front tiling can provide up to a 2.4x speedup compared to using the fast-memory cache without temporal tiling and 3.3x speedup compared to only using DDR memory for large problem sizes on the isotropic application. Respective speedups of 1.9x and 2.8x are demonstrated for the staggered-grid application.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.