SummaryHeterogeneous devices are becoming necessary components of high performance computing infrastructures, and the graphics processing unit (GPU) plays an important role in this landscape. Given a problem, the established approach for exploiting the GPU is to design solutions that are parallel, without data dependencies. These solutions are then offloaded to the GPU's massively parallel capability. This design principle often leads to developing applications that cannot maximize GPU hardware utilization. The goal of this article is to challenge this common belief by empirically showing that allowing even simple forms of synchronization enables programmers to design solutions that admit conflicts and achieve better performance. Our experience shows that lock‐based solutions to the k‐means clustering problem, implemented using two well‐known locking strategies, outperform the well‐engineered and parallel KMCUDA on both synthetic and real datasets; with an average 8× faster runtimes across all locking algorithms on a synthetic dataset and 1.7× faster on a real world dataset across all locking algorithms (and max speedups of 71.3× and 2.75×, respectively). We validate these results using a more sophisticated clustering algorithm, namely fuzzy c‐means and summarize our findings by identifying three guidelines to help make concurrency effective when programming GPU applications.
Read full abstract