Abstract
Sorting is a key computational kernel in many big data applications. Most sorting implementations focus on a specific input size, record width, and hardware configuration. This has created a wide array of sorters that are optimized only to a narrow application domain. In this work we show that merge trees can be implemented on FPGAs to offer state-of-the-art performance over many problem sizes. We introduce a novel merge tree architecture and develop Bonsai, an adaptive sorting solution that takes into consideration the off-chip memory bandwidth and the amount of on-chip resources to optimize sorting time. FPGA programmability allows us to leverage Bonsai to quickly implement the optimal merge tree configuration for any problem size and memory hierarchy. Using Bonsai, we develop a state-of-the-art sorter which specifically targets DRAM-scale sorting on AWS EC2 F1 instances. For 4-32 GB array size, our implementation has a minimum of 2.3x, 1.3x, 1.2x and up to 2.5x, 3.7x, 1.3x speedup over the best designs on CPUs, FPGAs, and GPUs, respectively. Our design exhibits 3.3x better bandwidth-efficiency compared to the best previous sorting implementations. Finally, we demonstrate that Bonsai can tune our design over a wide range of problem sizes (megabyte to terabyte) and memory hierarchies including DDR DRAMs, high-bandwidth memories (HBMs) and solid-state disks (SSDs).
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have