Time and energy modeling of high–performance Level-3 BLAS on x86 architectures

Francisco D. Igual,Sandra Catalán,Pedro Alonso,Rafael Rodríguez-Sánchez,Enrique S. Quintana-Ortí,Rafael Mayo

doi:10.1016/j.simpat.2015.04.003

Abstract

We present accurate piece-wise models for the time and energy costs of high performance implementations of both the matrix multiplication (gemm) and the triangular system solve with multiple right-hand sides (trsm) on x86 architectures. Our methodology decouples the costs due to the floating-point arithmetic/data movement occurring in the higher levels of the cache hierarchy from those of packing/data transfers between the main memory and the L2/L3 cache. A careful analytical study of the data transfers, in combination with an architecture-specific calibration of the costs per operation, render then the components to assemble piece-wise models for the accurate estimation of gemm and trsm’s performance on x86 processors.Our experimental results on an Intel Xeon E5-2620 processor confirm the accuracy of this approach, which reports relative errors for different shapes of gemm and trsm that are, respectively, around 1.5% and 4.5% on average for both time and energy.

Full Text