Abstract

Dataflow architecture has been proved to be promising in high-performance computing. Traditional dataflow architectures are not efficient enough in typical scientific applications such as stencil and FFT due to low utilization of function units. Based on the blocking and parallelism features of scientific applications, we design SPU, an efficient dataflow architecture for scientific applications. In SPU, dataflow graphs translated from the loop body in scientific applications are mapped to the Processing Element(PE) Array. Iterations enter the dataflow graph in pipeline during execution meanwhile three levels of parallelism are exploited to improve the utilization of function units in dataflow architectures: inner-graph parallelism, pipelining parallelism and inter graph parallelism. The experimental results show that the average energy efficiency of SPU achieves 25.97GFlops/W in 40 nm technology and the utilization of floating point function units in SPU is 2.82x that of typical dataflow architecture on average for typical scientific applications.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call