Abstract

In this paper we will present a detailed study of implementing double-precision matrix-matrix multiplication (DGEMM) utilizing the Intel Xeon Phi Coprocessor. We discuss a DGEMM algorithm implementation running on the coprocessor, minimizing communication with the host CPU. We will run DGEMM across a range of matrix sizes natively as well using Intel Math Kernel Library. Our optimizations were designed to support maximal reuse of on-die cache, which significantly reduces transfer from GDDR. Finally we analyze the improvement of a classic matrix multiplication implementation based on Cauchy algorithm compared to the latest results achieved using the Intel Math Kernel Library DGEMM subroutine.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call