Abstract

High Level Synthesis (HLS) tools targeting Field Programmable Gate Arrays (FPGAs) aim to provide a method for programming these devices via high-level abstractions. Initially, HLS support for FPGAs focused on compiling C/C++ to hardware circuits. This raised the issue of determining the programming practices which resulted in the best performing circuits. Recently, to further increase the applicability of HLS approaches, renewed effort was placed on support for HLS of OpenCL code for FPGA, raising the same issues of coding practices and performance portability. This paper explores the performance of OpenCL code compiled for FPGAs for different coding techniques. We evaluate the use of task-kernels versus NDRange kernels, data vectorization, the use of on-chip local memories, and data transfer optimizations by exploiting burst access inference. We present this exploration via a case study of the k-means algorithm, and produce a total of 10 OpenCL implementations of the kernel. To determine the effects of different data set characteristics, and to determine the gains from specialization based on number of attributes, we generated a total of 12 integer data sets. The data sets vary regarding the number of instances, number of attributes (i.e., features), and number of clusters. We also vary the number of processing cores, and present the resulting required resources and operating frequencies. Finally, we execute the same OpenCL code on a 4 GHz Intel i7-6700K CPU, showing that the FPGA achieves speedups up to $1.54 {\times } $ for four cases, and energy savings up to 80% in all cases.

Highlights

  • Unlike devices such as Central Processing Units (CPUs) and Graphics Processing Units (GPUs), the reconfigurability of Field Programmable Gate Array (FPGA) allows for very finely-tuned and application-specific implementations of circuits

  • We evaluate the performance of OpenCL code on FPGA resulting from applying multiple coding techniques, including the use of single-task kernels versus NDRange kernels, combined with data vectorization and the use of local memories and burst accesses to local memory

  • Data is exchanged between the CPU and the FPGA via the system memory, by resorting to traditional OpenCL API calls

Read more

Summary

INTRODUCTION

Unlike devices such as Central Processing Units (CPUs) and Graphics Processing Units (GPUs), the reconfigurability of Field Programmable Gate Array (FPGA) allows for very finely-tuned and application-specific implementations of circuits. The beneficial circuit specialization, the respective lack of programmability implies the same design effort for future revisions In order make these devices more suited for general use, over a decade of development has focused on efficient generation of circuits via High Level Synthesis (HLS) of source code such as (subsets of) C/C++ or MATLAB [4], [5]. We evaluate the performance of OpenCL code on FPGA resulting from applying multiple coding techniques, including the use of single-task kernels versus NDRange kernels, combined with data vectorization and the use of local memories and burst accesses to local memory. We study these aspects via the popular k-means algorithm [15].

RELATED WORK
IMPLEMENTED CODE VERSIONS
FPGA VS CPU
COMPARISON TO C IMPLEMENTATION
DISCUSSION AND OBSERVATIONS
Result
VIII. CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call