Abstract

Parallelizing the memory accesses in a nested loop is a critical challenge to facilitate loop pipelining. An effective approach for high-level synthesis on field-programmable gate array is to map these accesses to multiple on-chip memory banks using a memory partitioning technique. In this paper, we propose an efficient memory partitioning algorithm with low overhead and low time complexity for parallel data access via data reuse. We find that for most applications in image and video processing, a large amount of data can be reused among different iterations of a loop nest. Motivated by this observation, we propose to cache reusable data using on-chip registers, organized as register chains. The nonreusable data are then separated into several memory banks by a memory partitioning algorithm. We revise the existing padding method to cover cases occurring frequently in our method wherein certain components of partition vector are zeros. Experimental results have demonstrated that compared with the state-of-the-art algorithms, the proposed method is efficient in terms of execution time, resource overhead, and power consumption across a wide range of access patterns extracted from applications in image and video processing. As for the testing patterns, the execution time is typically less than one millisecond. And the number of required memory banks is reduced by 59.7% on average, which leads to an average reduction of 78.2% in look-up tables, 65.5% in flip-flops, 37.1% in DSP48Es, and therefore 74.8% reduction in dynamic power consumption. Moreover, the storage overhead incurred by the proposed method is zero for most widely used access patterns in image filtering.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call