Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. - eScholarship

Kaushik Datta ,Volkov ,D Patterson ,Mark Murphy ,Leonid Oliker ,Samuel Williams ,J Carter ,John Shalf ,Ka Yelick

doi:10.1145/1413370.1413375

Abstract

Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures Kaushik Datta ∗† , Mark Murphy † , Vasily Volkov † , Samuel Williams ∗† , Jonathan Carter ∗ , Leonid Oliker ∗† , David Patterson ∗† , John Shalf ∗ , and Katherine Yelick ∗† CRD/NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Computer Science Division, University of California at Berkeley, Berkeley, CA 94720, USA Abstract Understanding the most efﬁcient design and utilization of emerging multicore systems is one of the most chal- lenging questions faced by the mainstream and scientiﬁc computing industries in several decades. Our work ex- plores multicore stencil (nearest-neighbor) computations — a class of algorithms at the heart of many structured grid codes, including PDE solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural trade- offs of emerging multicore designs and their implications on scientiﬁc algorithm development. 1. Introduction The computing industry has recently moved away from exponential scaling of clock frequency toward chip mul- tiprocessors (CMPs) in order to better manage trade-offs among performance, energy efﬁciency, and reliability [1]. Because this design approach is relatively immature, there is a vast diversity of available CMP architectures. System designers and programmers are confronted with a confus- ing variety of architectural features, such as multicore, SIMD, simultaneous multithreading, core heterogeneity, and unconventional memory hierarchies, often combined in novel arrangements. Given the current ﬂux in CMP design, it is unclear which architectural philosophy is best suited for a given class of algorithms. Likewise, this architectural diversity leads to uncertainty on how to refactor existing algorithms and tune them to take the maximum advantage of existing and emerging platforms. Understanding the most efﬁcient design and utilization of these increasingly parallel multicore systems is one of the most challenging questions faced by the computing industry since it began. This work presents a comprehensive set of multicore optimizations for stencil (nearest-neigbor) computations — a class of algorithms at the heart of most calculations involving structured (rectangular) grids, including both implicit and explicit partial differential equation (PDE) solvers. Our work explores the relatively simple 3D heat equation, which can be used as a proxy for more complex stencil calculations. In addition to their importance in scientiﬁc calculations, stencils are interesting as an archi- tectural evaluation benchmark because they have abundant parallelism and low computational intensity, offering a mixture of opportunities for on-chip parallelism and chal- lenges for associated memory systems. Our optimizations include NUMA afﬁnity, array padding, core/register blocking, prefetching, and SIMDiza- tion — as well as novel stencil algorithmic transformations that leverage multicore resources: thread blocking and circular queues. Since there are complex and unpredictable interactions between our optimizations and the underlying architectures, we develop an auto-tuning environment for stencil codes that searches over a set of optimizations and their parameters to minimize runtime and provide performance portability across the breadth of existing and future architectures. We believe such application-speciﬁc auto-tuners are the most practical near-term approach for obtaining high performance on multicore systems. To evaluate the effectiveness of our optimization strate- gies we explore the broadest set of multicore archi- tectures in the current HPC literature, including the out-of-order cache-based microprocessor designs of the dual-socket×quad-core AMD Barcelona and the dual- socket×quad-core Intel Clovertown, the heterogeneous local-store based architecture of the dual-socket×eight- core fast double precision STI Cell QS22 PowerX- Cell 8i Blade, as well as one of the ﬁrst scientiﬁc studies of the hardware-multithreaded dual-socket×eight- core×eight-thread Sun Victoria Falls machine. Addition- ally, we present results on the single-socket×240-core mul- tithreaded streaming NVIDIA GeForce GTX280 general purpose graphics processing unit (GPGPU). This suite of architectures allows us to compare the mainstream multicore approach of replicating conventional cores that emphasize serial performance (Barcelona and

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. - eScholarship

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Editorial: Special Section on CMP Architectures
Ravi Iyer ... Dean M Tullsen
IEEE Transactions on Parallel and Distributed Systems | VOL. 18
Ravi Iyer, et. al.Ravi Iyer ... Dean M Tullsen
01 Aug 2007
IEEE Transactions on Parallel and Distributed Systems | VOL. 18

Reliability aware throughput management of chip multi-processor architecture via thread migration
Fatemeh Pouyan ... Saeed Safari
The Journal of Supercomputing | VOL. 72
Fatemeh Pouyan, et. al.Fatemeh Pouyan ... Saeed Safari
18 Feb 2016
The Journal of Supercomputing | VOL. 72

Efficiency of thread-level speculation in SMT and CMP architectures - performance, power and thermal perspective
Venkatesan Packirisamy ... Pen-Chung Yew
-
Venkatesan Packirisamy, et. al.Venkatesan Packirisamy ... Pen-Chung Yew
01 Oct 2008
01 Oct 2008

Hardware Budget and Runtime System for Data-Driven Multithreaded Chip Multiprocessor
Kyriakos Stavrou ... Paraskevas Evripidou
-
Kyriakos Stavrou, et. al.Kyriakos Stavrou ... Paraskevas Evripidou
01 Jan 2006
01 Jan 2006

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. - eScholarship

Abstract

Talk to us

Similar Papers