Designing and dynamically load balancing hybrid LU for multi/many-core

Michael Deisher,Pradeep Dubey,Mikhail Smelyanskiy,Brian Nickerson,Michael Chuvelev,Victor W Lee

doi:10.1007/s00450-011-0169-x

Abstract

Designing high-performance LU factorization for modern hybrid multi/many-core systems requires highly-tuned BLAS subroutines, hiding communication latency and balancing the load across devices of variable processing capabilities. In this paper we show how single-precision LU factorization is accelerated on Intel® MIC(Many Integrated Core) architecture in both native and hybrid (Intel® Xeon® processor and Intel MIC) configurations. Our SGEMM implementation delivers close to 1 Tflop/s on Intel's first implementation of Intel MIC architecture [codenamed Knight's Ferry (KNF)] silicon platform. Our implementation takes full advantage of multiple levels of memory hierarchy on MIC, and successfully utilizes up to 80% of its peak compute capability. Our LU factorization performance exceeds 570 Gflop/s including matrix transfer overhead when executed entirely on a KNF coprocessor. Our hybrid implementation, which offloads parts of LU processing to a dual-socket multi-core Intel Xeon processor X5680 host, delivers up to 772 Gflop/s. The novel aspect of our implementations is dynamic resource partitioning to improve load balance across the entire system.

Full Text