Abstract

The LU factorization of a matrix A is a widely used algorithm, for instance in the solution of linear systems Ax = b. The increasing capacities of high performance computers allow us to use direct methods for systems of large and dense matrices. To build portable and efficient LU codes for vector and parallel computers, this method is rewritten in block versions and BLAS (Basic Linear Algebra Subprograms) kernels are used to mask the architectural details and allow good performance of codes such as the LAPACK (Linear Algebra PACKage) library. In the references it was proved that this strategy leads to portability and efficiency of codes using tuned BLAS kernels. After a short description of the block versions we will present some results obtained on the VAX 6520/2VP, comparing the block algorithm versus point algorithm, and vectorized versions versus scalar versions. The three columnwise versions of the block algorithm showed similar performance for this computer and large matrix dimensions. The block size used is a crucial parameter for these algorithms and the results show that the best performance is obtained with block size 64 (for large matrices) which is the vector registered size of the machine used.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call