An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512

Roktaek Lim,Yeongha Lee,Jaeyoung Choi,Raehyun Kim

doi:10.1007/s10586-018-2810-y

Abstract

The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have recently emerged with 2D tile mesh architecture and the Intel AVX-512 instructions. However, it is very difficult for general users to get the maximum performance from the new architecture since they are not familiar with optimal cache reuse, efficient vectorization, and assembly language. In this paper, we illustrate several developing strategies to achieve good performance with C programming language by carrying out general matrix–matrix multiplications and without the use of assembly language. Our implementation of matrix–matrix multiplication is based on blocked matrix multiplication as an optimization technique that improves data reuse. We use data prefetching, loop unrolling, and the Intel AVX-512 to optimize the blocked matrix multiplications. When we use a single core of the KNL, our implementation achieves up to 98% of SGEMM and 99% of DGEMM using the Intel MKL, which is the current state-of-the-art library. Our implementation of the parallel DGEMM using all 68 cores of the KNL achieves up to 90% of DGEMM using the Intel MKL.

Full Text