Deferring accelerator offloading decisions to application runtime

Gavin Vaz,Christian Plessl,Heinrich Riebler,Tobias Kenter

doi:10.1109/reconfig.2014.7032509

Abstract

Reconfigurable architectures provide an opportunity to accelerate a wide range of applications, frequently by exploiting data-parallelism, where the same operations are homogeneously executed on a (large) set of data. However, when the sequential code is executed on a host CPU and only data-parallel loops are executed on an FPGA coprocessor, a sufficiently large number of loop iterations (trip counts) is required, such that the control- and data-transfer overheads to the coprocessor can be amortized. However, the trip count of large data-parallel loops is frequently not known at compile time, but only at runtime just before entering a loop. Therefore, we propose to generate code both for the CPU and the coprocessor, and to defer the decision where to execute the appropriate code to the runtime of the application when the trip count of the loop can be determined just at runtime. We demonstrate how an LLVM compiler based toolflow can automatically insert appropriate decision blocks into the application code. Analyzing popular benchmark suites, we show that this kind of runtime decisions is often applicable. The practical feasibility of our approach is demonstrated by a toolflow that automatically identifies loops suitable for vectorization and generates code for the FPGA coprocessor of a Convey HC-1. The toolflow adds decisions based on a comparison of the runtime-computed trip counts to thresholds for specific loops and also includes support to move just the required data to the coprocessor. We evaluate the integrated toolflow with characteristic loops executed on different input data sizes.

Full Text