Accelerating explicit ODE methods on GPUs by kernel fusion

Tim Werner,Matthias Korch

doi:10.1002/cpe.4470

Abstract

SummaryGraphics processing units (GPUs) have a promising architecture for implementing highly parallel solution methods for systems of ordinary differential equations (ODEs). However, their high performance comes at the price of caveats such as small caches or wide SIMD. For ODE methods, optimizing the memory access pattern is often crucial. In this article, instead of considering only one specific method, we generalize the description of explicit ODE methods by using data flow graphs consisting of basic operations that are suitable to cover the types of computations occurring in all common explicit methods. After showing that the straightforward approach for processing the data flow graph by calling one kernel per basic operation is memory bound, we explain how the number of memory accesses can be reduced by the kernel fusion technique, which fuses several basic operations into one kernel. Moreover, we will present enabling transformations that allow additional fusions and thus can reduce the number of memory accesses even further. We apply these optimizations to three different classes of explicit ODE methods: embedded Runge–Kutta (RK) methods, parallel iterated RK (PIRK) methods, and peer methods. A detailed experimental evaluation on three modern GPUs showed speedups between 1.86 and 3.51 compared to unfused implementations.

Full Text