Abstract

Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures Kaushik Datta ∗† , Mark Murphy † , Vasily Volkov † , Samuel Williams ∗† , Jonathan Carter ∗ , Leonid Oliker ∗† , David Patterson ∗† , John Shalf ∗ , and Katherine Yelick ∗† CRD/NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Computer Science Division, University of California at Berkeley, Berkeley, CA 94720, USA Abstract Understanding the most efficient design and utilization of emerging multicore systems is one of the most chal- lenging questions faced by the mainstream and scientific computing industries in several decades. Our work ex- plores multicore stencil (nearest-neighbor) computations — a class of algorithms at the heart of many structured grid codes, including PDE solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural trade- offs of emerging multicore designs and their implications on scientific algorithm development. 1. Introduction The computing industry has recently moved away from exponential scaling of clock frequency toward chip mul- tiprocessors (CMPs) in order to better manage trade-offs among performance, energy efficiency, and reliability [1]. Because this design approach is relatively immature, there is a vast diversity of available CMP architectures. System designers and programmers are confronted with a confus- ing variety of architectural features, such as multicore, SIMD, simultaneous multithreading, core heterogeneity, and unconventional memory hierarchies, often combined in novel arrangements. Given the current flux in CMP design, it is unclear which architectural philosophy is best suited for a given class of algorithms. Likewise, this architectural diversity leads to uncertainty on how to refactor existing algorithms and tune them to take the maximum advantage of existing and emerging platforms. Understanding the most efficient design and utilization of these increasingly parallel multicore systems is one of the most challenging questions faced by the computing industry since it began. This work presents a comprehensive set of multicore optimizations for stencil (nearest-neigbor) computations — a class of algorithms at the heart of most calculations involving structured (rectangular) grids, including both implicit and explicit partial differential equation (PDE) solvers. Our work explores the relatively simple 3D heat equation, which can be used as a proxy for more complex stencil calculations. In addition to their importance in scientific calculations, stencils are interesting as an archi- tectural evaluation benchmark because they have abundant parallelism and low computational intensity, offering a mixture of opportunities for on-chip parallelism and chal- lenges for associated memory systems. Our optimizations include NUMA affinity, array padding, core/register blocking, prefetching, and SIMDiza- tion — as well as novel stencil algorithmic transformations that leverage multicore resources: thread blocking and circular queues. Since there are complex and unpredictable interactions between our optimizations and the underlying architectures, we develop an auto-tuning environment for stencil codes that searches over a set of optimizations and their parameters to minimize runtime and provide performance portability across the breadth of existing and future architectures. We believe such application-specific auto-tuners are the most practical near-term approach for obtaining high performance on multicore systems. To evaluate the effectiveness of our optimization strate- gies we explore the broadest set of multicore archi- tectures in the current HPC literature, including the out-of-order cache-based microprocessor designs of the dual-socket×quad-core AMD Barcelona and the dual- socket×quad-core Intel Clovertown, the heterogeneous local-store based architecture of the dual-socket×eight- core fast double precision STI Cell QS22 PowerX- Cell 8i Blade, as well as one of the first scientific studies of the hardware-multithreaded dual-socket×eight- core×eight-thread Sun Victoria Falls machine. Addition- ally, we present results on the single-socket×240-core mul- tithreaded streaming NVIDIA GeForce GTX280 general purpose graphics processing unit (GPGPU). This suite of architectures allows us to compare the mainstream multicore approach of replicating conventional cores that emphasize serial performance (Barcelona and

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call