Abstract

Accelerating sorting using dedicated hardware to fully utilize the memory bandwidth for Big Data applications has gained much interest in the research community. Recently, parallel sorting networks have been widely employed in hardware implementations due to their high data parallelism and low control overhead. In this paper, we propose a systematic methodology for mapping large-scale bitonic sorting networks onto FPGA. To realize data permutations in the sorting network, we develop a novel RAM-based design by vertically “folding” the classic Clos network. By utilizing the proposed design for data permutation, we develop a hardware generator to automatically build bitonic sorting architectures on FPGAs. For given input size, data width and data parallelism, the hardware generator specializes both the datapath and the control unit for sorting and generates a design in high level hardware description language. We demonstrate trade-offs among throughput, latency and area using two illustrative sorting designs including a high throughput design and a resource efficient design. With a data parallelism of p (2 ≤ p ≤ N/2), the high throughput design sorts an N-key sequence with latency 6N=p + o(N), throughputp results per cycle and uses 6N + o(N) memory. This achieves optimal memory efficiency (defined as the ratio of throughput to the amount of on-chip memory used by the design) and outperforms the state-of-the-art. Experimental results show that the designs obtained by our proposed hardware generator achieve 49 to 112 percent improvement in energy efficiency and 56 to 430 percent higher memory efficiency compared with the state-of-the-art.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call