Efficient implementation of QR decomposition on intel multi-core processors

Mostafa I Soliman

doi:10.1109/icenco.2011.6153928

Abstract

This paper shows how to make the QR decomposition algorithm run faster on Intel multi-core processors by exploiting explicit parallelism and memory hierarchy. Streaming SIMD extensions and multithreading computation on multiple cores are used to exploit data-level parallelism (DLP) and thread-level parallelism (TLP), respectively. In addition, memory hierarchy is exploited by performing the QR computation on blocks of data to reduce the impact of memory latency by reusing the loaded data in cache memories. On Core 2 Duo E7500 with two cores (2-physical/2-logical processors), Core i5 M520 with two cores supporting Hyper-Threading technology (2-physical/4-logical processors), and Xeon E5410 with four cores (4-physical/4-logical processors), the average speedup of multithreaded SIMD implementation of the block QR decomposition on 1000×1000 up to 3000×3000 matrices in step of 100 are about 6.6, 9.6, and 11.3 times higher than the unparallel execution, respectively. On reasonably large matrix size 2000 × 2000 (4000 × 4000), our experimental results show that the use of Intel streaming SIMD extensions, multithreading, SIMD multithreading, matrix blocking, blocking SIMD, blocking multithreading, and blocking SIMD multithreading speedup QR decomposition on Core 2 Duo E7500 by factors of about 2.1 (2.1), 1.8 (1.8), 2.2 (2.2), 1.7 (1.7), 5.6 (5.6), 2.7 (2.6), and 6.6 (6.3), on Core i5 M520 by factors of about 3.7 (3.6), 2.2 (2.6), 3.8 (4), 1.9 (1.9), 7.9 (7.8), 2.9 (3), and 9.6 (10.7), and on Xeon E5410 by factors of about 2.6 (2.3), 3.2 (2.8), 4.7 (3), 1.5 (1.5), 5.4 (4.9), 5 (5.1), and 12.1 (7), respectively.

Full Text