Abstract

Energy consumption is increasingly becoming a critical issue in HPC. There is a broad consensus that future exascale-computing will be strongly constrained by energy consumption. Heterogeneous systems usually feature higher energy efficiency than homogeneous ones since the former employ coprocessors that provide higher GFlops/Watt than CPUs. Thus, it is of great importance to better utilize the coprocessors from an energy-efficiency standpoint. Dense LU factorization (LU) is a critical kernel that is widely used to solve dense linear algebra problems. However, existingheterogeneous implementations are typically designed to be CPU-centered, which rely highly on CPUs and thus suffer from large data transfer overheads via PCIe, hurting the energy efficiency of the entire computer system. We present a coprocessor-resident implementation of LU for a heterogeneous platform to improve energy efficiency without impeding performance by relieving the CPUs from performing unnecessary computations and reducing excessive data transfers via PCIe. In addition, several optimizations are judiciously employed to overlap the computation and communication between the CPUs and coprocessors. Validation on the Tianhe-2 supercomputer shows that our LU implementation gains higher performance, achieves higher energy efficiency, and features a better scalability than Intel MKL.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call