QR Factorization for the Cell Broadband Engine

Jakub Kurzak,Jack Dongarra

doi:10.1155/2009/239720

Abstract

The QR factorization is one of the most important operations in dense linear algebra, offering a numerically stable method for solving linear systems of equations including overdetermined and underdetermined systems. Modern implementations of the QR factorization, such as the one in the LAPACK library, suffer from performance limitations due to the use of matrix–vector type operations in the phase of panel factorization. These limitations can be remedied by using the idea of updating of QR factorization, rendering an algorithm, which is much more scalable and much more suitable for implementation on a multi-core processor. It is demonstrated how the potential of the cell broadband engine can be utilized to the fullest by employing the new algorithmic approach and successfully exploiting the capabilities of the chip in terms of single instruction multiple data parallelism, instruction level parallelism and thread-level parallelism.

Highlights

State of the art, numerical linear algebra software utilizes block algorithms in order to exploit the memory hierarchy of traditional cache-based systems [1,2,3,4]
The results are checked for correctness by comparing the R factor produced by the algorithm to the R factor produced by a call to the LAPACK routine SGEQRF ran on the Power Processing Element (PPE)
It needs to be mentioned that the implementation utilizes Block Data Layout (BDL) [32,33], where each tile is stored in a continuous 16 kB portion of the main memory, which can be transferred in a single Direct Memory Access (DMA), what puts an equal load on all 16 memory banks

Summary

Introduction

Numerical linear algebra software utilizes block algorithms in order to exploit the memory hierarchy of traditional cache-based systems [1,2,3,4]. Public domain libraries such as LAPACK [5] and ScaLAPACK [6] are good examples. These implementations work on square or rectangular submatrices in their inner loops, where operations are encapsulated in calls to Basic Linear Algebra Subroutines (BLAS) [7], with emphasis on expressing the computation as level 3 BLAS (matrix–matrix type) operations. This article focuses exclusively on the aspects of efficient implementation of the algorithm and makes no attempts at discussing the issues of numerical quality of the results related to the use of single precision with truncation rounding, and lack of support for NaNs and denorms (which is the way the Cell BE implements single precision floating point operations)

Related work

Algorithm

Implementation

Cell BE architecture overview

SIMD vectorization

Parallelization – single Cell BE

Parallelization – Dual Cell BE

Results

Conclusions

Future work

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Scientific Programming	Publication Date: Jan 1, 2009
Citations: 50	License type: CC BY 3.0

R Discovery Prime

R Discovery Prime

QR Factorization for the Cell Broadband Engine

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Programming

Lead the way for us

Similar Papers

Scheduling dense linear algebra operations on multicore processors
Rosa M Badia ... Jack Dongarra
Concurrency and Computation: Practice and Experience | VOL. 22
Rosa M Badia, et. al.Rosa M Badia ... Jack Dongarra
11 Aug 2009
Concurrency and Computation: Practice and Experience | VOL. 22

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading
Jack L Lo ... Rebecca L Stamm
ACM Transactions on Computer Systems | VOL. 15
Jack L Lo, et. al.Jack L Lo ... Rebecca L Stamm
01 Aug 1997
ACM Transactions on Computer Systems | VOL. 15

Improving power efficiency of dense linear algebra algorithms on multi-core processors via slack control
Pedro Alonso ... Manuel F Dolz
-
Pedro Alonso, et. al.Pedro Alonso ... Manuel F Dolz
01 Jul 2011
01 Jul 2011

A systematic approach to improving data locality across Fourier transforms and linear algebra operations
Andrew Canning ... John Shalf
-
Andrew Canning, et. al.Andrew Canning ... John Shalf
03 Jun 2021
03 Jun 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

QR Factorization for the Cell Broadband Engine

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Programming