Abstract
String sorting is a fundamental kernel of string matching and database index construction; yet, it has not been studied as extensively as fixed-length keys sorting. Because processing variable-length keys in hardware is challenging, it is no surprise that no hardware-accelerated string sorters have been proposed yet. In this paper, we present Parallel Hybrid Super Scalar String Sample Sort (pHS5) on Intel HARPv2, a heterogeneous CPU-FPGA system with a server-grade CPU. Our pHS5 extends pS5, the state-of-the-art string sorting algorithm for multi-core shared memory CPUs, by adding multiple processing elements (PEs) on the FPGA. Each PE accelerates one instance of the most effectively parallelizable among the dominant kernels of pS5 by up to 33% compared to a single Intel Xeon Broadwell core despite a clock frequency that is 17 times slower. Furthermore, we extended the job scheduling mechanism of pS5 to schedule the accelerable kernel not only among available CPU cores but also on our PEs, while retaining the complex high-level control flow and the sorting of the smaller data sets on the CPU. Overall, we accelerate the entire algorithm by up to 10% with respect to the 28-thread software baseline running on the Xeon processor and by up to 36% at lower thread counts. Finally, we generalize our results assuming pS5 as representative of software that is heavily optimized for modern multi-core CPUs and investigate the performance and energy advantage that an FPGA in a datacenter setting can offer to regular RTL users compared to additional CPU cores.
Highlights
Sorting is one of the most studied problems in computer science [1] and a fundamental building block of countless algorithms and applications [2,3,4,5]
We presented Parallel Hybrid S5 (pHS5), to our knowledge the first hardwareaccelerated string sorter, which has been implemented on the Intel HARPv2 CPU-FPGA heterogeneous system
One of our processing elements accelerates one of the dominant kernels in Parallel S5 (pS5) by up to 33% compared to a single Xeon core, and 6 PEs accelerate the entire application by up to 10% compared to pS5 running in its fully parallel version on a 14-core Xeon CPU
Summary
Sorting is one of the most studied problems in computer science [1] and a fundamental building block of countless algorithms and applications [2,3,4,5]. In its simplest and yet most common form of fixed-length key (e.g., integer) sorting, a plethora of highly optimized parallel implementations have been proposed on multiple compute platform: CPUs [6], GPUs [7, 8], and FPGAs [9,10,11]. Compared to fixed-length sorting, many fewer solutions have been proposed for sorting variable-length strings lexicographically. This is a building block of suffix sorting, used in string matching, and database index construction [13]. Parallel string sorting algorithms have been proposed on CPUs [14] and GPUs [15], to the best of our knowledge, no hardware accelerator for this problem has been made available yet. Handling variable-length keys in hardware is challenging per se and involves key comparisons that can become expensive as keys are arbitrarily long
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.