Abstract

This paper presents an effective implementation of Strassen's algorithm for matrix-matrix multiplication on shared memory multi-core architecture. The proposed algorithm aims to augment the computation speed in terms of GFLOPS performance on average 4.5 and 4.1 times faster than Eigen and OpenBLAS, respectively while reducing the power consumption to as low as possible. Our algorithm relies on using AVX512 intrinsics, loop unrolling factor, and OpenMP directives. A new 2D blocking data allocation pattern is proposed for Strassen's algorithm to provide optimized cache temporal and spatial locality. The proposed implementation reduced not only the amount of main memory but also the burden of unnecessary memory allocation/deallocation and data transferring for each level of recursion in Strassen's algorithm. Moreover, the proposed algorithm consumed, on average, 4.25 and 3.67 times lower energy than the multiplication functions of the Eigen and OpenBLAS libraries, respectively. To measure the computational performance with the awareness of power consumption, GFLOPS per Watt (GFPW) is calculated, which out- performed on average 3.78 and 3.47 times higher than those of Eigen and OpenBLAS libraries, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call