Performance Evaluation of Matrix-Matrix Multiplications Using Intel's Advanced Vector Extensions (AVX)

Somaia Awad Hassan,A.M Hemeida,Mountasser M.M Mahmoud

doi:10.1016/j.micpro.2016.10.002

Abstract

Intel's Advanced Vector Extensions is known as single instruction multiple data streams (SIMD), and the instruction sets is introduced in the second-generation Intel Core processor family. This new technology is supported by new generations of Intel and AMD processors. The advanced vector extensions (AVX) exploits single instruction multiple data (SIMD) computing units for fine grained-parallelism. These instructions process multiple data elements simultaneously and independently. Many applications such as signal processing, recognition, visual processing, scientific and engineering numerical, physics and other areas of applications need for vector floating point performance supported by AVX. Matrix-Matrix multiplications is the core of many important algorithms such as signal processing, scientific and engineering numerical, so it is substantial to accelerate implementation of matrix-matrix multiplications. It is very important to use appropriate compilers that can optimally utilize the new features of the evolving processors. For this purpose, a clear vision on the performance of the compilers on performance characteristics of AVX is needed. In addition choosing the appropriate programming method is substantial to gain the best performance. In this paper, the performance evaluation of matrix-matrix multiplications in three forms (C=A. B, C=A. BT, and C=AT. B), using Intel's advanced vector extension (AVX) instruction sets has been reported. The obtained results are compared using inline assembly versus intrinsic functions for programming. A comparative study to indicate the effects of two widely used C++ compilers: Intel C++ compiler (ICC) in Intel Parallel Studio XE 2016 against Microsoft Visual Studio C++ compiler 2015 (MSVC++) has been investigated. The results are evaluated on Intel Core i7 processor on a Broadwell system for square matrices of different large sizes. The results demonstrate that the Intel compiler has better performance than MSVC++ compiler by 1.34, 1.32, and 1.22 using inline assembly language and by 1.36, 1.19, and 1.25 using intrinsic functions for C=A. B, C=A. BT, and C=AT. B, respectively. The performance of using intrinsic functions compared to the inline assembly demonstrates that the intrinsic functions has better performance than inline assembly by 2.1, 2.13, and 2.18 using Intel compiler and by 2.08, 2.49, and 2.11 using MSVC++ compiler for C=A. B, C=A. BT, and C=AT. B, respectively.

Full Text