490 publications found
Sort by
FDRA: A Framework for Dynamically Reconfigurable Accelerator Supporting Multi-Level Parallelism

Coarse-grained reconfigurable architectures (CGRAs) have emerged as promising accelerators due to their high flexibility and energy efficiency. However, existing open-source works often lack integration of CGRAs with CPU systems and corresponding toolchains. Moreover, there is rare support for the accelerator instruction pipelining to overlap data communication, computation, and configuration across multiple tasks. In this paper, we propose FDRA, an open-source exploration framework for a heterogeneous system-on-chip (SoC) with a RISC-V processor and a dynamically reconfigurable accelerator (DRA) supporting loop, instruction, and task levels of parallelism. FDRA encompasses parameterized SoC modeling, Verilog generation, source-to-source application code transformation using frontend and DRA compilers, SoC simulation, and FPGA prototyping. FDRA incorporates the extraction of periodic accumulative operators and multidimensional linear load/store operators from nested loops. The DRA enables accessing the shared L2 cache with virtual addresses and supports direct memory access (DMA) with arbitrary start addresses and data lengths. Integrated into the RISC-V Rocket SoC, our DRA achieves a remarkable 55 × acceleration for loop kernels and improves energy efficiency by 29 ×. Compared to state-of-the-art RISC-V vector units, our DRA demonstrates a 2.9 × speed improvement and 3.5 × greater energy efficiency. In contrast to previous CGRA+RISC-V SoCs, our SoC achieves a minimum speedup of 5.2 ×.

Open Access
Relevant
Designing Deep Learning Models on FPGA with Multiple Heterogeneous Engines

Deep learning models are becoming more complex and heterogeneous with new layer types to improve their accuracy. This brings a considerable challenge to the designers of accelerators of deep neural networks. There have been several architectures and design flows to map deep learning models on hardware, but they are limited to a particular model and/or layer types. Also, the architectures generated by these tools target, in general, high-performance devices, not appropriate for embedded computing. This paper proposes a multi-engine architecture and a design flow to implement deep learning models on FPGA. The hardware design uses high-level synthesis to allow design space exploration. The architecture is scalable and therefore applicable to any density FPGAs. The architecture and design flow were applied to the development of a hardware/software system for image classification with ResNet50, object detection with YOLOv3-Tiny and image segmentation with Deeplabv3+. The system was tested in a low-density Zynq UltraScale+ ZU3EG FPGA to show its scalability. The results show that the proposed multi-engine architecture generates efficient accelerators. An accelerator of ResNet50 with a 4-bit quantization achieves 67 FPS, and the object detector with YOLOv3-Tiny with a throughput of 36 FPS and the image segmentation application achieves 1.4 FPS.

Open Access
Relevant
<scp>Strega</scp> : An HTTP Server for FPGAs

The computer architecture landscape is being reshaped by the new opportunities, challenges and constraints brought by the cloud. On the one hand, high-level applications profit from specialised hardware to boost their performance and reduce deployment costs. On the other hand, cloud providers maximise the CPU time allocated to client applications by offloading infrastructure tasks to hardware accelerators. While it is well understood how to do this for, e.g., network function virtualisation and protocols such as TCP/IP, support for higher networking layers is still largely missing, limiting the potential of accelerators. In this paper, we present Strega , an open-source 1 light-weight HTTP server that enables crucial functionality such as FPGA-accelerated functions being called through a RESTful protocol (FPGA-as-a-Function). Our experimental analysis shows that a single Strega node sustains a throughput of 1.7 M HTTP requests per second with an end-to-end latency as low as 16 μ s, outperforming nginx running on 32 vCPUs in both metrics, and can even be an alternative to the traditional OpenCL flow over the PCIe bus. Through this work, we pave the way for running microservices directly on FPGAs, bypassing CPU overhead and realising the full potential of FPGA acceleration in distributed cloud applications.

Open Access
Relevant
High-Efficiency TRNG Design based on Multi-bit Dual-ring Oscillator

Unpredictable true random numbers are required in security technology fields such as information encryption, key generation, mask generation for anti-side-channel analysis, algorithm initialization, etc. At present, the true random number generator (TRNG) is not enough to provide fast random bits by low-speed bits generation. Therefore, it is necessary to design a faster TRNG. This work presents an ultra-compact TRNG with high throughput based on a novel extendable dual-ring oscillator (DRO). Owing to multiple bits output per cycle in DRO can be used to obtain the original random sequence, the proposed DRO achieves a maximum resource utilization to build a more efficient TRNG, compared with the conventional TRNG system based on ring oscillator (RO), which only has a single output and needs to build multiple groups of ring oscillators. TRNG based on the 2-bit DRO and its 8-bit derivative structure has been verified on Xilinx Artix-7 and Kintex-7 FPGA under the automatic layout and routing and has achieved a throughput of 550Mbps and 1100Mbps respectively. Moreover, in terms of throughput performance over operating frequency, hardware consumption, and entropy, the proposed scheme has obvious advantages. Finally, the generated sequences show good randomness in the test of NIST SP800-22 and Dieharder test suite and pass the entropy estimation test kit NIST SP800-90B and AIS-31.

Open Access
Relevant