Abstract
Whilst FPGAs have been integrated in cloud ecosystems, strict constraints for mapping hardware to spatially diverse distribution of heterogeneous resources at run-time, makes their utilization for shared multi tasking challenging. This work aims at analyzing the effects of such constraints on the achievable compute density, i.e the efficiency in utilization of available compute resources. A hypothesis is proposed and uses static off-line partitioning and mapping of heterogeneous tasks to improve space sharing on FPGA. The hypothetical approach allows the FPGA resource to be treated as a service from higher level and supports multi-task processing, without the need for low level infrastructure support. To evaluate the effects of existing constraints on our hypothesis, we implement a relatively comprehensive suite of ten real high performance computing tasks and produce multiple bitstreams per task for fair evaluation of the various schemes. We then evaluate and compare our proposed partitioning scheme to previous work in terms of achieved system throughput. The simulated results for large queues of mixed intensity (compute and memory) tasks show that the proposed approach can provide higher than \(3{\times }\) system speedup. The execution on the Nallatech 385 FPGA card for selected cases suggest that our approach can provide on average \(2.9{\times }\) and \(2.3{\times }\) higher system throughput for compute and mixed intensity tasks while \(0.2{\times }\) lower for memory intensive tasks.
Highlights
We evaluate an alternative approach to partially reconfigurable regions (PRRs) by hypothesizing that a higher compute density can be achieved via static partitioning and mapping (SPM) of heterogeneous bitstreams
In addition to OpenCL, we use general high level synthesis parameters, to scale the task over multiple parallel compute units (CUs); multiple pipelines can be defined via a Single Instruction Multiple Data (SIMD) parameter, whilst the key compute intensive loops can be unrolled via the UNROLL (U) parameter
The maximum throughput is defined by the largest bitstream, limited by Field Programmable Gate Arrays (FPGAs) resources
Summary
Cloud computing offers users ubiquitous access to a shared pool of resources, through centralized data centres. With increasing device sizes and efficiency for high performance computing, there has been an increased interest in recent times to integrate Field Programmable Gate Arrays (FPGAs) in data centres [5][11]. Their architecture and programming environment presents a different resource sharing model when compared to software programmable accelerators. Heterogeneous tasks in our context are defined by heterogeneity in resource utilization (compute, memory, logic) and execution time. The FPGA is partitioned into rectangular PRRs which are configured typically with a new bitstream via DPR, independently of the processing going on in other PRRs [17] This provides independence in time to each PRR, such that a task A running in a PRR can be instantly replaced by task B, when task A finishes. The FPGA is divided into multiple clock regions across both the vertical and horizontal axes, where the crossing of the region boundary requires custom logic implementation
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.