Abstract

The QR factorization is one of the most important operations in dense linear algebra, offering a numerically stable method for solving linear systems of equations including overdetermined and underdetermined systems. Modern implementations of the QR factorization, such as the one in the LAPACK library, suffer from performance limitations due to the use of matrix–vector type operations in the phase of panel factorization. These limitations can be remedied by using the idea of updating of QR factorization, rendering an algorithm, which is much more scalable and much more suitable for implementation on a multi-core processor. It is demonstrated how the potential of the cell broadband engine can be utilized to the fullest by employing the new algorithmic approach and successfully exploiting the capabilities of the chip in terms of single instruction multiple data parallelism, instruction level parallelism and thread-level parallelism.

Highlights

  • State of the art, numerical linear algebra software utilizes block algorithms in order to exploit the memory hierarchy of traditional cache-based systems [1,2,3,4]

  • The results are checked for correctness by comparing the R factor produced by the algorithm to the R factor produced by a call to the LAPACK routine SGEQRF ran on the Power Processing Element (PPE)

  • It needs to be mentioned that the implementation utilizes Block Data Layout (BDL) [32,33], where each tile is stored in a continuous 16 kB portion of the main memory, which can be transferred in a single Direct Memory Access (DMA), what puts an equal load on all 16 memory banks

Read more

Summary

Introduction

Numerical linear algebra software utilizes block algorithms in order to exploit the memory hierarchy of traditional cache-based systems [1,2,3,4]. Public domain libraries such as LAPACK [5] and ScaLAPACK [6] are good examples. These implementations work on square or rectangular submatrices in their inner loops, where operations are encapsulated in calls to Basic Linear Algebra Subroutines (BLAS) [7], with emphasis on expressing the computation as level 3 BLAS (matrix–matrix type) operations. This article focuses exclusively on the aspects of efficient implementation of the algorithm and makes no attempts at discussing the issues of numerical quality of the results related to the use of single precision with truncation rounding, and lack of support for NaNs and denorms (which is the way the Cell BE implements single precision floating point operations)

Related work
Algorithm
Implementation
Cell BE architecture overview
SIMD vectorization
Parallelization – single Cell BE
Parallelization – Dual Cell BE
Results
Conclusions
Future work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call