Abstract

Dense linear algebra kernels are critical for wireless, and the oncoming proliferation of 5G only amplifies their importance. Due to the inductive nature of many such algorithms, parallelism is difficult to exploit: parallel regions have fine-grain producer/consumer interaction with iteratively changing depen-dence distance, reuse rate, and memory access patterns. This makes multi-threading impractical due to fine-grain synchronization, and vectorization ineffective due to the non-rectangular iteration domain. CPUs, DSPs, and GPUs perform order-of-magnitude below peak. Our insight is that if the nature of inductive dependences and memory accesses were explicit in the hardware/software interface, then a spatial architecture could efficiently execute parallel code regions. To this end, we first develop a novel execution model, inductive dataflow, where inductive dependence patterns and memory access patterns (streams) are first-order primitives. Second, we develop a hybrid spatial architecture combining systolic and tagged dataflow execution to attain high utilization at low energy and area cost. Finally, we create a scalable design through a novel vector-stream control model which amortizes control overhead both in time and spatially across architecture lanes. We evaluate our design, REVEL, with a full stack (compiler, ISA, simulator, RTL). Across a suite of linear algebra kernels, REVEL outperforms equally-provisioned DSPs by 4.6×-37×. Compared to state-of-the-art spatial architectures, REVEL is mean 3× faster. Compared to a set of ASICs, REVEL is only 2× the power and half the area.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call