Block-Cyclic Dense Linear Algebra

Woody Lichtenstein,S Lennart Johnsson

doi:10.1137/0914075

Abstract

Block-cyclic order elimination algorithms for LU and OR factorization and solve routines are described for distributed memory architectures with processing nodes configured as two-dimensional arrays of arbitrary shape. The cyclic-order elimination, together with a consecutive data allocation, yields good load balance for both the factorization and solution phases for the solution of dense systems of equations by LU and OR decomposition. Blocking may offer a substantial performance enhancement on architectures for which the level-2 or level-3 BLAS (basic linear algebra subroutines) are ideal for operations local to a node. High-rank updates local to a node may have a performance that is a factor of four or more higher than a rank-1 update.This paper shows that in many parallel implementations, the $O(N^2 )$ work in the factorization may be of the same significance as the $O(N^3 )$ work, even for large matrices. The $O(N^2 )$ work is poorly load balanced in two-dimensional nodal arrays, which are shown to b...

Highlights

The (O N2) work is poorly load{balanced in two{dimensional nodal arrays, which we show are optimal with respect to communication for consecutive data allocation, block{cyclic order elimination, and a simple, but fairly general, communications model
The main contributions of this paper are: 1) empirical evidence that a block{cyclic order elimination can be used e ectively on distributed memory architectures to achieve load{ balance as an alternative to block{cyclic data allocation, 2) a discussion of the issues that arise when the block{cyclic orderings of rows and columns are di erent, which is the typical case when the number of processing nodes is not a square, and 3) a proof that within a wide class of regular data layouts, two{dimensional nodal arrays with consecutive data allocation and cyclic elimination order are optimal for elimination based dense linear algebra routines
The peak performance for the global factorization routines is about two thirds of the peak performance of the local level{2 BLAS routines used for the O(N3) work in the factorization

Summary

Introduction

The main contributions of this paper are: 1) empirical evidence that a block{cyclic order elimination can be used e ectively on distributed memory architectures to achieve load{ balance as an alternative to block{cyclic data allocation, 2) a discussion of the issues that arise when the block{cyclic orderings of rows and columns are di erent, which is the typical case when the number of processing nodes is not a square, and 3) a proof that within a wide class of regular data layouts, two{dimensional nodal arrays with consecutive (block) data allocation and cyclic elimination order are optimal for elimination based dense linear algebra routines. The e ectiveness of the block{cyclic order elimination demonstrates the utility of

Blocking for improved performance of local BLAS

Standard data layouts

Cyclic order factorization and triangular system solution

Balanced work load

Rectangular arrays of processing nodes

QR factorization and system solution

Performance

Measurements

Communication

Arithmetic e ciency

Detailed performance analysis

Scalability

Optimal layouts

Findings

Summary

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: SIAM Journal on Scientific Computing	Publication Date: Nov 1, 1993
Citations: 51	License type: cc-by

R Discovery Prime

R Discovery Prime

Block-Cyclic Dense Linear Algebra

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: SIAM Journal on Scientific Computing

Lead the way for us

Similar Papers

BLAS (Basic Linear Algebra Subroutines), Linear Algebra Modules and Supercomputers.
J R Rice
-
J R RiceJ R Rice
31 Dec 1985
31 Dec 1985

Performance Evaluation of Basic Linear Algebra Subroutines on a Matrix Co-processor
Ahmed S Zekri ... Stanislav G Sedukhin
-
Ahmed S Zekri, et. al.Ahmed S Zekri ... Stanislav G Sedukhin
09 Sep 2007
09 Sep 2007

BLAS IV: A BLAS for Rk Matrix Algebra
John Shaeffer
Applied Computational Electromagnetics Society | VOL. 35
John ShaefferJohn Shaeffer
03 Feb 2021
Applied Computational Electromagnetics Society | VOL. 35

Automatically Tuned Linear Algebra Software
R.C Whaley ... J.J Dongarra
-
R.C Whaley, et. al.R.C Whaley ... J.J Dongarra
01 Jan 1998
01 Jan 1998

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Block-Cyclic Dense Linear Algebra

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: SIAM Journal on Scientific Computing