Performance Evaluation of Basic Linear Algebra Subroutines on a Matrix Co-processor

Ahmed S. Zekri,Stanislav G. Sedukhin

doi:10.1007/978-3-540-68111-3_126

Abstract

As increasing clock frequency approaches its physical limits, a good approach to enhance performance is to increase parallelism by integrating more cores as coprocessors to general-purpose processors in order to handle the different workloads of scientific and signal processing applications. Many kernels in these applications lend themselves to the data-parallel architectures such as array processors. The basic linear algebra subroutines (BLAS) are standard operations to efficiently solve the linear algebra problems on high performance and parallel systems. In this paper, we implement and evaluate the performance of some important BLAS operations on a matrix coprocessor. Our analytical model shows the performance of the Level-3 BLAS represented by the n×n matrix multiply-add operation approaches the theoretical peak as n increases since the degree of data reuse is high. However, the performance of Level-1 and Level-2 BLAS operations is low as a result of low data reuse. Fortunately, many applications are based on intensive use of Level- 3 BLAS with small percentage of Level-1 and Level-2 BLAS.

Full Text