Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system

Charlene Yang,Samuel Williams,Thorsten Kurth

doi:10.1002/cpe.5547

Abstract

SummaryThe Roofline performance model provides an intuitive and insightful approach to identifying performance bottlenecks and guiding performance optimization. In preparation for the next‐generation supercomputer Perlmutter at NERSC, this paper presents a methodology to construct a hierarchical Roofline on NVIDIA GPUs and extends it to support reduced precision and Tensor Cores. The hierarchical Roofline incorporates L1, L2, device memory, and system memory bandwidths into one single figure, and it offers more profound insights into performance analysis than the traditional DRAM‐only Roofline. We use our Roofline methodology to analyze three proxy applications: GPP from BerkeleyGW, HPGMG from AMReX, and conv2d from TensorFlow. In doing so, we demonstrate the ability of our methodology to readily understand various aspects of performance and performance bottlenecks on NVIDIA GPUs and motivate code optimizations.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Concurrency and computation : practice & experience	Publication Date: Nov 12, 2019
Citations: 43	License type: publisher-specific, author manuscript

R Discovery Prime

R Discovery Prime

Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system

Abstract

Talk to us

Similar Papers

More From: Concurrency and computation : practice & experience

Lead the way for us

Similar Papers

An Instruction Roofline Model for GPUs
Nan Ding ... Samuel Williams
-
Nan Ding, et. al.Nan Ding ... Samuel Williams
01 Nov 2019
01 Nov 2019

Instruction Roofline: An insightful visual performance model for GPUs
Nan Ding ... Muaaz Awan
Concurrency and computation : practice & experience | VOL. 34
Nan Ding, et. al.Nan Ding ... Muaaz Awan
01 Sep 2021
Concurrency and computation : practice & experience | VOL. 34

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance
Hiroyuki Ootomo ... Rio Yokota
The International Journal of High Performance Computing Applications | VOL. 36
Hiroyuki Ootomo, et. al.Hiroyuki Ootomo ... Rio Yokota
03 Jun 2022
The International Journal of High Performance Computing Applications | VOL. 36

Solving DWF dirac equation using multi-splitting preconditioned conjugate gradient with tensor cores on NVIDIA GPUs
Jiqun Tu ... Chulwoo Jung
-
Jiqun Tu, et. al.Jiqun Tu ... Chulwoo Jung
05 Jul 2021
05 Jul 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system

Abstract

Talk to us

Similar Papers

More From: Concurrency and computation : practice & experience