Adaptive blocking in the QR factorization

Christian H Bischof

doi:10.1007/bf00127828

Abstract

On most high-performance architectures, data movement is slow compared to floating point (in particular, vector) performance. On these architectures block algorithms have been successful for matrix computations. By considering a matrix as a collection of submatrices (the so-called blocks), one naturally arrives at algorithms that require little data movement. The optimal blocking strategy, however, depends on the computing environment and on the problem parameters. On parallel machines, tradeoffs between individual floating point performance and overall system performance also come into play. Current approaches use fixed-width blocking strategies which are not optimal. This paper presents an adaptive blocking methodology for determining a good blocking strategy systematically. We demonstrate this technique on a block QR factorization routine on a distributed-memory machine. Using timing models for the high-level kernels of the algorithm, we can formulate in a recurrence relation a blocking strategy that avoids adding extra delays along the critical path of the algorithm. This recurrence relation predicts performance well since we base our timing models on observed data, not other simplistic measures. Experiments on the Intel iPSC/1 hypercube show that, in fact, the resulting blocking strategy is as good as any fixed-width blocking strategy, independent of problem size and the number of processors employed. So while we do not know the optimum fixed-width blocking strategy unless we rerun the same problem several times, adaptive blocking provides close to optimum performance in the first run. We also mention how adaptive blocking can result in performance portable code by automating the generation of the timing models.

Full Text