Abstract

SummaryHeterogeneous devices are becoming necessary components of high performance computing infrastructures, and the graphics processing unit (GPU) plays an important role in this landscape. Given a problem, the established approach for exploiting the GPU is to design solutions that are parallel, without data dependencies. These solutions are then offloaded to the GPU's massively parallel capability. This design principle often leads to developing applications that cannot maximize GPU hardware utilization. The goal of this article is to challenge this common belief by empirically showing that allowing even simple forms of synchronization enables programmers to design solutions that admit conflicts and achieve better performance. Our experience shows that lock‐based solutions to the k‐means clustering problem, implemented using two well‐known locking strategies, outperform the well‐engineered and parallel KMCUDA on both synthetic and real datasets; with an average 8× faster runtimes across all locking algorithms on a synthetic dataset and 1.7× faster on a real world dataset across all locking algorithms (and max speedups of 71.3× and 2.75×, respectively). We validate these results using a more sophisticated clustering algorithm, namely fuzzy c‐means and summarize our findings by identifying three guidelines to help make concurrency effective when programming GPU applications.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.