Compiling for the IBM Matrix Engine for Enterprise Workloads

Joao P L De Carvalho,Jose Nelson Amaral,Jose E Moreira

doi:10.1109/mm.2022.3176529

Abstract

The matrix-multiply assist (MMA) facility is the latest addition to IBM’s power instruction set architecture and first shipped in the recently introduced POWER10 processor. MMA is designed to accelerate matrix–matrix operations, such as matrix multiplication and convolution, using instructions that compute the outer product of vector-register operands. Outer product computations have been used for decades in linear algebra libraries to deliver high-performance implementations of matrix operations. Such libraries use conventional single-instruction–multiple-data (SIMD) instructions to emulate outer product operations. MMA in POWER10 is the first hardware with direct support for outer product operations released in the market. MMA operates with the widest diversity of data types compared to any accelerator design currently announced. Unleashing the high-performance enabled by MMA requires careful code generation. Two key considerations for optimal MMA code performance are 1) the choice of accumulation layout when maximizing the using the accumulators and 2) the selection of matrix access order. This article shows that over 92% of peak performance in POWER10 with MMA can be achieved when the code generation makes the right choices.

Full Text