Abstract

Block-cyclic order elimination algorithms for LU and OR factorization and solve routines are described for distributed memory architectures with processing nodes configured as two-dimensional arrays of arbitrary shape. The cyclic-order elimination, together with a consecutive data allocation, yields good load balance for both the factorization and solution phases for the solution of dense systems of equations by LU and OR decomposition. Blocking may offer a substantial performance enhancement on architectures for which the level-2 or level-3 BLAS (basic linear algebra subroutines) are ideal for operations local to a node. High-rank updates local to a node may have a performance that is a factor of four or more higher than a rank-1 update.This paper shows that in many parallel implementations, the $O(N^2 )$ work in the factorization may be of the same significance as the $O(N^3 )$ work, even for large matrices. The $O(N^2 )$ work is poorly load balanced in two-dimensional nodal arrays, which are shown to b...

Highlights

  • The (O N2) work is poorly load{balanced in two{dimensional nodal arrays, which we show are optimal with respect to communication for consecutive data allocation, block{cyclic order elimination, and a simple, but fairly general, communications model

  • The main contributions of this paper are: 1) empirical evidence that a block{cyclic order elimination can be used e ectively on distributed memory architectures to achieve load{ balance as an alternative to block{cyclic data allocation, 2) a discussion of the issues that arise when the block{cyclic orderings of rows and columns are di erent, which is the typical case when the number of processing nodes is not a square, and 3) a proof that within a wide class of regular data layouts, two{dimensional nodal arrays with consecutive data allocation and cyclic elimination order are optimal for elimination based dense linear algebra routines

  • The peak performance for the global factorization routines is about two thirds of the peak performance of the local level{2 BLAS routines used for the O(N3) work in the factorization

Read more

Summary

Introduction

The main contributions of this paper are: 1) empirical evidence that a block{cyclic order elimination can be used e ectively on distributed memory architectures to achieve load{ balance as an alternative to block{cyclic data allocation, 2) a discussion of the issues that arise when the block{cyclic orderings of rows and columns are di erent, which is the typical case when the number of processing nodes is not a square, and 3) a proof that within a wide class of regular data layouts, two{dimensional nodal arrays with consecutive (block) data allocation and cyclic elimination order are optimal for elimination based dense linear algebra routines. The e ectiveness of the block{cyclic order elimination demonstrates the utility of

Blocking for improved performance of local BLAS
Standard data layouts
Cyclic order factorization and triangular system solution
Balanced work load
Rectangular arrays of processing nodes
QR factorization and system solution
Performance
Measurements
Communication
Arithmetic e ciency
Detailed performance analysis
Scalability
Optimal layouts
Findings
Summary

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.