Abstract

The performance of most scientific applications depends on efficient mathematical libraries. For example, scientific applications like the plane wave based Density Functional Theory approach for electronic structure calculations uses highly optimized libraries for Fourier transforms, dense linear algebra (orthogonalization) and sparse linear algebra (non-local projectors in real space). Although vendor-tuned libraries offer efficient implementations for each standalone mathematical kernel, the partitioning of those calls into sequentially invoked kernels inhibits cross-kernel optimizations that could improve data locality across memory bound operations. In this work we show that, by expressing these kernels as an operation on high dimensional tensors, cross-kernel dataflow optimizations that span FFT, dense and sparse linear algebra, can be readily exposed and exploited. We outline a systematic way of merging the Fourier transforms with the linear algebra computations, improving data locality and reducing data movement to main memory. We show that compared to conventional implementations, this streaming/dataflow approach offers 2x speedup on GPUs and 8x/12x speedup on CPUs compared to a baseline code that uses vendor-optimized libraries. Although we use Density Functional Theory to demonstrate the value of our approach, our methodology is broadly applicable to other applications that use Fourier transforms and linear algebra operations as building blocks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call