Abstract

In this paper we investigate the dissimilar multithreading programming paradigms on x86 CPU architectures, where the recently released Intel Xeon Phi Coprocessor and commonly used Intel Xeon processors were studied, as well as the NVIDIA K20 GPU, which represents the cutting-edge general purpose graphics processing unit. The relevant numerical algorithm selected to address the problem is power method, which is widely used to compute the dominant eigenvalue of a matrix. This work focuses on dense linear algebra. The frequently used multi-core or many-core processor parallelism techniques include OpenMP, Intel Cilk Plus, Intel Threading Building Blocks, i.e. TBB, along with the optimized computing libraries such as Intel Math Kernel Library(MKL) or the NVIDIA CUDA Basic Linear Algebra Subroutines(cuBLAS) library. Optimized implementations of these techniques were separately applied to the aforementioned architectures. For the reason that a unitary programming model may not satisfy the growing performance demand, we also explored some possible mix of these languages. The study shows that the hybrid pattern of multithreading and data parallelism via explicit vectorization maximizes the performance on x86 architectures, which allows us to obtain 80% of the sustainable peak performance in double precision on the Intel Many Integrated Core(MIC) Architecture. In the case of single precision, this number reaches even 96%. In addition, this approach enables a reasonable performance by requiring least developing time. The numbers of iterations till convergence are roughly the same in both architectures of CPU and GPU. The GPU performs better in small matrix sizes. However, the Intel Xeon Phi coprocessor excels for large sizes with a better scalability.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call