Effective Implementation of Matrix–Vector Multiplication on Intel's AVX multicore Processor

Somaia A Hassan,Mountasser M.M Mahmoud,A.M Hemeida,Mahmoud A Saber

doi:10.1016/j.cl.2017.06.003

Abstract

Matrix–vector multiplication kernel is one of the most important and common computational operations which form the core of varied important application areas such as scientific and engineering applications. Therefore, it is substantial to optimize and accelerate its implementation. This paper proposes an optimized algorithm for single-precision matrix–vector multiplication (SGEMV) on the Intel core i7 processor. An overview of the Intel's advanced vector extension instructions in implementing dense matrix–vector multiplication kernels in parallel has been comprehensively addressed. Also, a variety of performance optimization techniques using Intel's advanced vector extension (AVX) instruction sets, memory access optimization, and OpenMP parallelization has been designed. Additionally, the performance of the proposed algorithms is evaluated in compared to the latest version of Intel Math Kernel Library SGEMV 2017 subroutines because Intel Math Kernel Library subroutines also consider the same optimization methods that are used in this paper. In this paper, we have introduced an overview of the optimization techniques, have explained the specific details of handling them in the proposed algorithm, and also have showed the advantages and the challenges of combining them together in contrast to the previous works which usually have concentrated on a single technique and the performance achieved by it. The guidelines of parallel implementation of the proposed algorithm and the characteristics of the target architecture that should be considered when implementing this algorithm have been investigated. An overview of the Intel's advanced vector extension instructions in implementing dense matrix–vector multiplication kernels in parallel has been comprehensively addressed. A comparative study of the two most popularly used C++ compilers: Intel C++ compiler 17.0 in Intel Parallel Studio XE 2017 against Microsoft Visual Studio C++ compiler 2015 has been investigated. Finally, the comparison between two primary ways of utilizing AVX instructions: inline assembly and intrinsic functions, and the comparison between single-core and multi-core platforms have introduced. The results are evaluated in Intel Core i7-5600U processor of 2.6 GHz with 128 KB L1 cache, 512 KB L2 cache, and 4MB L3 cache running on windows 10 operating system and on a Broadwell system. The obtained results of the proposed optimized algorithm are implemented on square matrices of different large sizes range from 1024 to 19456. The results indicate a performance improvement of 18.2% and 14.1% for (y = A. x) and (y = AT. x) respectively in compared with the results which are obtained using the latest version of Intel Math Kernel Library 2017(SGEMV) subroutines on multi-core platform.

Full Text