Data staging for efficient high throughput stream processing

Thaddeus Koehn,Peter Athanas

doi:10.1016/j.parco.2019.102566

Abstract

High-bandwidth stream-oriented applications often demand high throughput computation engines implemented on dedicated hardware such as FPGAs, or ASICs. In such circuits, the streaming width (number of inputs and outputs per cycle) multiplied by the clock frequency represents the maximum throughput of the architecture. Re-sequencing the data elements in both space and time (referred to here as permutations of the input/output sequence) can often lead to more computationally conducive architectural solutions. Structures for performing permutations on streaming data can be grouped into the classes of general permutations and linear permutations. While the subclass of linear permutations includes important permutations such as the perfect shuffle (stride) permutation, Hadamard reordering, and bit-reversal, the stream width must be a power of two. This article reviews the two implementation types and seeks to determine in which cases the general implementation type should be used. In one-to-one streaming widths, the linear implementation shows optimal or near optimal costs in area (RAM and logic gates), but for arbitrary streaming widths, the next largest power of two linear implementation must be used. In these cases, the general implementation overhead may be more cost efficient than the larger stream width for the linear implementation.

Full Text