Abstract

Although there are many efficient sorting algorithms and implementations for graphics processing units (GPUs), none of them are both comparison-based and work in-place. The sorting algorithm presented in this chapter is a sorting algorithm for NVIDIA's GPUs that is both comparison-based and works in-place. The algorithm used in the implementation presented is bitonic sort. Although the time complexity of this algorithm is O(nlog2n), it is a widely used parallel sorting algorithm. Bitonic sort can be efficiently parallelized since it is based on a sorting network. Processing a sorting network in parallel requires a mechanism for synchronization and communication between parallel processing units. Using CUDA, those units are typically implemented by CUDA-threads.In general, synchronization between arbitrary threads is not possible in CUDA; thus, to ensure a specific order of tasks, these tasks have to be executed in consecutive kernel launches. “Communication” among consecutive kernel launches is obtained by writing to (persistent) global GPU memory. It specifically focuses on two main aspects when implementing bitonic sort for NVIDIAs GPUs—reducing communication and synchronization induced by bitonic sort in order to reduce the number of kernel launches and accesses to global memory; extensively using the shared memory and efficient inner-block synchronization. This results in a decreased number of kernel launches and global memory accesses.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.