Abstract

Accelerators such as GPUs have become popular general-purpose computing device in the field of high-performance computing. With the boosting ability of storage and computation, it is very important to solve the complex scientific and engineering problems on CPU–GPU heterogeneous system in the big data era. Now the compute-intensive problems have been successfully solved using CPU–GPU cooperative computing. However, it is difficult to handle large-scale data-intensive problems, especially for those limited by GPU device memory. In this paper, the dual buffer rotation four-stage pipeline (DBFP) mechanism is proposed for CPU–GPU cooperative computation to efficiently handle data-intensive problems, which need larger memory than that of a single GPU. The data block partition-based pipeline computing strategy is designed on top of the DBFP mechanism. On the one hand, it breaks out the bottleneck of limited GPU device memory. On the other hand, it explores high-performance computing of CPU and GPU with data transfer and computation overlap. Furthermore, it is easy to extend the DBFP mechanism on the heterogeneous system equipped with multiple GPUs and achieve high resource utilization. The results show that it can achieve 99 and 90% of theoretical performance for dense general matrix multiplication on one GPU and two GPUs respectively with Nvidia GTX480 or K40 GPUs. It also enables K-means and T-nearest-neighbor algorithms to process larger datasets, which used to be limited by the GPU device memory. We achieve nearly 1.9-fold performance gains by dynamic task scheduling on two GPUs when the performance bottleneck is GPU computing or data transmission.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call