Case Study 2: Parallel Compact WY QR Factorization

Ian N Dunn ,Gérard G L Meyer

doi:10.1007/978-1-4419-8650-4_6

Abstract

During the past five years the widespread availability of tuned kernels for performing matrix-matrix multiplication has dramatically narrowed the focus of parallel algorithm research in the field of linear algebra. Underlying this change is the fact that an efficient subroutine can exploit properties of the processor superscalar design and memory hierarchies to compute a matrix- matrix multiplication faster than a subroutine can sequentially compute the component matrix-vector multiplications. Indeed, studies have shown that substantial gains in performance can be realized by redesigning linear algebra algorithms to increase the percentage of operations performed as matrix- matrix multiplication (Bischof et al., 1994; Dongarra et al., 1989; Gallivan et al., 1988; Schreiber and Van Loan, 1989). This is evidenced on the SGI POWER Challenge where LAPACK reports an efficiency of 268 Mflops when multiplying two 1000 × 1000 matrices, but only 41 Mflops when multiplying a 1000 × 1000 matrix and a 1000 element vector (Anderson et al., 1995). A potential six-fold increase in performance is strong impetus for developing algorithms whose computations can be expressed in terms of matrix-matrix multiplication instead of matrix-vector multiplication. Solution procedures whose component computations cannot be cast in terms of matrix-matrix multiplication are no longer the focus of much research.

Full Text