Abstract

In this work, we present the design and implementation of an ultra-low latency Deep Reinforcement Learning (DRL) FPGA based accelerator for addressing hard real-time Mixed Integer Programming problems. The accelerator exhibits ultra-low latency performance for both training and inference operations, enabled by training-inference parallelism, pipelined training, on-chip weights and replay memory, multi-level replication-based parallelism and DRL algorithmic modifications such as distribution of training over time. The design principles can be extended to support hardware acceleration for other relevant DRL algorithms (embedding the experience replay technique) with hard real time constraints. We evaluate the accuracy of the accelerator in a task offloading and resource allocation problem stemming from a Mobile Edge Computing (MEC/5G) scenario. The design has been implemented on a Xilinx Zynq Ultrascale+ MPSoC ZCU104 evaluation kit using High Level Synthesis. The accelerator achieves near optimal performance and exhibits a 10-fold decrease in training-inference execution latency when compared to a high-end CPU-based implementation.

Highlights

  • Reinforcement Learning (RL) has been adopted as a solution mechanism for various problems in edge computing, including Mobile Edge Computing (MEC) scenarios ranging from task offloading and resource allocation to routing, caching placement and energy harvesting

  • As demanding -Industrial mostly- Internet of Things (IoT) prompts towards the generation of edge computing, hard real time constraints should be met by the algorithms orchestrating the computational resource allocation; a latency bottleneck in the control operations might result in malfunction of the overall computational network

  • In this paper we present the design of an FPGA based Accelerator for Deep RL (DRL) with on-chip weight and replay memory and application specific resource utilization

Read more

Summary

INTRODUCTION

Reinforcement Learning (RL) has been adopted as a solution mechanism for various problems in edge computing, including Mobile Edge Computing (MEC) scenarios (supported by the expansion of 5G networking technologies) ranging from task offloading and resource allocation to routing, caching placement and energy harvesting. The main use cases for RL in edge computing can be divided in two large categories, namely NP-hard problems (e.g. Mixed Integer Programming-MIP) and problems with inherent Information Uncertainty/Asymmetry about the underlying network parameters and computational resources For the former case, Deep RL (DRL) algorithms have shown superior performance in comparison with state-of-the-art conventional techniques, such as Linear Relaxation [1], resulting in both faster convergence and solutions of higher quality. The environment simulators allow for the utilization of generated inputs and provide action feedback at a much higher rate than the physical instance of the environment This leads to the availability of a large volume of training samples which require acceleration of both training and inference in order to be successfully utilized by the DRL algorithm. [7] and [9] focus on both training and inference, the complex structure of the NN they are targeting does not allow real time operation since the computational complexity is not consistent with the strict latency constraints of the problems of interest

BACKGROUND
EFFICIENCY DRIVEN ALGORITHMIC MODIFICATIONS
ACCELERATOR ARCHITECTURE
USE CASE AND ACCURACY EVALUATION
IMPLEMENTATION AND LATENCY-ENERGY EVALUATION
Findings
VIII. CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call