Tensor-optimized hardware accelerates fused discontinuous Galerkin simulations

Alexander Heinecke,Alexander Breuer,Yifeng Cui

doi:10.1016/j.parco.2019.102550

Abstract

In recent years the computation/memory balance of processors has been continuously shifting towards computation. The rise of Deep Learning, which is based on matrix multiplications, accelerated this path, especially in terms of single precision and lower precision computation. An important research question is if this development can be leveraged for traditional HPC. In this work we demonstrate that a high order discontinuous Galerkin solver for seismic wave propagation can execute in single precision without loss of modeling accuracy. Additionally, we extended its kernels to support the Intel Knights Mill CPU with 14 TFLOPS of tensor-optimized single precision performance. This allows us to exploit the hardware’s special computation capabilities, even in a regular HPC application with sparse linear algebra kernels. At the cluster-level, Knights Mill can obtain the same application performance as the latest top-bin dual socket Intel Xeon Platinum nodes. Compared to the HPC-focused Knights Landing processor, speed-ups of up to 1.6 × are possible, depending on the configuration of the solver. Additionally, we are able to increase the throughput of a quadrature-free discontinuous Galerkin method for seismic simulations by 4.2 × , when comparing our solver’s single precision and fifth order performance to the SC 2017 best-paper award winning work.

Full Text