Abstract
OpenCL programming ability combined with FPGAs pipelined parallelism have enabled high-performance execution and power-efficient solutions for massively parallel applications. This paper provides an exhaustive study on the scalability of OpenCL coarse-grain parallelism, Compute Unit (CU) replication on cloud FPGAs. This work demonstrates that for many applications there is an optimum number of CUs to achieve the maximum performance benefits with respect to memory bandwidth, memory conflicts introduced by CU replication and available FPGA resources. At the same time, the paper provides a source-code template and an optimized front-end design tool to explore and identify the optimum CU number for a given application, while hiding the programming and exploration difficulties from programmers. Our experimental results on 15 applications taken from the Xilinx SDAccel v2017.4 suite and the Rodinia Benchmark Suite v3.1 show a speedup of 6.2X, bandwidth improvement of 3.5X with a mere 1.04X power and less than 10% resource utilization on average. In addition, our tool results in a 31% improvement in the total design synthesis time for an illustrative Histogram application.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.