Local Basic Linear Algebra Subroutines (Lblas) for Distributed Memory Architectures and Languages With Array Syntax

Luis F Ortiz,S Lennart Johnsson

doi:10.1177/109434209200600403

Abstract

We describe a subset of the level-1, level-2, and level-3 BLAS implemented for each node of the Connection Machine system CM-200. The routines, collectively called LBLAS, have interfaces consistent with lan guages with an array syntax such as Fortran 90. One novel feature, important for distributed memory archi tectures, is the capability of performing computations on multiple instances of objects in a single call. The number of instances and their allocation across mem ory units, and the strides for the different axes within the local memories, are derived from an array descrip tor that contains type, shape, and data distribution in formation. Another novel feature of the LBLAS is a se lection of loop order for rank-1 updates and matrix- matrix multiplication based on array shapes, strides, and DRAM page faults. The peak efficiencies for the routines are in excess of 75%. Matrix-vector multiplica tion achieves a peak efficiency of 92%. The optimiza tion of loop ordering has a success rate exceeding 99.8% for matrices for which the sum of the lengths of the axes is at most 60. The success rate is even higher for all possible matrix shapes. The performance loss when a nonoptimal choice is made is less than ∼15% of peak and typically less than 1% of peak. We also show that the performance gain for high-rank updates may be as much as a factor of 6 over rank-1 updates.

Highlights

The Basic Linear Algebra Subroutines 1, 2, 8] (BLAS) are used in many scienti c codes, often being critical for the performance of those codes
We describe a subset of the level{1, level{2, and level{3 BLAS implemented for each node of the Connection Machine system CM{200
Note that for matrix{matrix multiplication, an algorithm based on the level{1 BLAS is memory bandwidth limited when there is a single oating{point unit for each data path to memory and the data paths internal to the oating{point processor and the paths to memory are of the same width

Summary

Introduction

The Basic Linear Algebra Subroutines 1, 2, 8] (BLAS) are used in many scienti c codes, often being critical for the performance of those codes. The level{3 BLAS 1] allow computations, such as matrix{matrix multiplication, to be performed with less demand on the memory bandwidth than when level{2 BLAS routines are used in the absence of interprocedural analysis and subsequent optimization of memory references by a compiler. When the data motion issues for mixing arrays of di erent rank are satisfactorily solved for the DBLAS, the CMSSL LBLAS will be extended to the exact same functionality as the corresponding routine in the conventional BLAS. Whenever there is a need to stress that the discussion refers to the LBLAS, we pre x the BLAS names with CMSSL, such as, for instance, CMSSL DDOT for the CMSSL routine computing inner products in double precision.

Library issues for languages with an array syntax

Estimated peak performance

Findings

Summary