Abstract

In this paper we propose an efficient hardware architecture that implements the Q-Learning algorithm, suitable for real-time applications. Its main features are low-power, high throughput and limited hardware resources. We also propose a technique based on approximated multipliers to reduce the hardware complexity of the algorithm. We implemented the design on a Xilinx Zynq Ultrascale+ MPSoC ZCU106 Evaluation Kit. The implementation results are evaluated in terms of hardware resources, throughput and power consumption. The architecture is compared to the state of the art of Q-Learning hardware accelerators presented in the literature obtaining better results in speed, power and hardware resources. Experiments using different sizes for the Q-Matrix and different wordlengths for the fixed point arithmetic are presented. With a Q-Matrix of size $8\times4$ (8 bit data) we achieved a throughput of 222 MSPS (Mega Samples Per Second) and a dynamic power consumption of 37 mW, while with a Q-Matrix of size $256\times16$ (32 bit data) we achieved a throughput of 93 MSPS and a power consumption 611 mW. Due to the small amount of hardware resources required by the accelerator, our system is suitable for multi-agent IoT applications. Moreover, the architecture can be used to implement the SARSA (State-Action-Reward-State-Action) Reinforcement Learning algorithm with minor modifications.

Highlights

  • Reinforcement Learning (RL) is a Machine Learning (ML) approach used to train an entity, called agent, to accomplish a certain task [1]

  • Software-based implementations performance is the main limitation in further development of such systems and the use of hardware accelerators based on FPGAs or ASICs can represent an efficient solution for implementing RL algorithms

  • In 2017, Su et al [24] proposed another Deep Q-Learning hardware implementation based on an Intel Arria-10 FPGA

Read more

Summary

INTRODUCTION

Reinforcement Learning (RL) is a Machine Learning (ML) approach used to train an entity, called agent, to accomplish a certain task [1]. The reward (or reinforcement) is a quality figure for the last action performed by the agent and it is represented as a positive or negative number Through this iterative process, the agent learns an optimal actionselection policy to accomplish its task. This kind of applications require powerful computing platforms able to process very large amount of data as fast as possible and with limited power consumption For these reasons, software-based implementations performance is the main limitation in further development of such systems and the use of hardware accelerators based on FPGAs or ASICs can represent an efficient solution for implementing RL algorithms. The size of this matrix is N × Z where N is the number of the possible agent’s states to sense the environment and Z is the number of possible actions that the agent can perform This means that Q-Learning operates in a discrete stateaction space S × A. In [16] it is proved that the knowledge of the Q-Matrix suffices to extract the optimal action-selection policy for a RL agent

RELATED WORK
PROPOSED ARCHITECTURE
MAX BLOCK
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.