Deep reinforcement learning: Algorithm, applications, and ultra-low-power implementation

Hongjia Li,Ruizhe Cai,Ning Liu,Xue Lin,Yanzhi Wang

doi:10.1016/j.nancom.2018.02.003

Abstract

In order to overcome the limitation of traditional reinforcement learning techniques on the restricted dimensionality of state and action spaces, the recent breakthroughs of deep reinforcement learning (DRL) in Alpha Go and playing Atari set a good example in handling large state and action spaces of complicated control problems. The DRL technique is comprised of an offline deep neural network (DNN) construction phase and an online deep Q-learning phase. In the offline phase, DNNs are utilized to derive the correlation between each state–action pair of the system and its value function. In the online phase, a deep Q-learning technique is adopted based on the offline-trained DNN to derive the optimal action and meanwhile update the value estimates and the DNN.This paper is the first to provide a comprehensive study of applications of the DRL framework on cloud computing and residential smart grid systems along with efficient hardware implementations. Based on the introduction of the general DRL framework, we develop two applications, one for the cloud computing resource allocation problem and one for the residential smart grid user-end task scheduling problem. The former could achieve up to 54.1% energy saving compared with baselines through automatically and dynamically distributing resources to servers. The latter achieves up to 22.77% total energy cost reduction compared with the baseline algorithm.The DRL framework is mainly utilized for the complicated control problems and requires light-weight and low-power implementations in edge and portable systems. In order to achieve this goal, we develop the ultra-low-power implementation of the DRL framework using the stochastic computing technique, which has the potential of significantly enhancing the computation speed and reducing hardware footprint and therefore the power/energy consumption. The overall implementation is based on the effective stochastic computing-based implementations of approximate parallel counter-based inner product blocks and tanh activation functions. The stochastic computing-based implementation achieves only 57941.61 μm2 area and 6.30 mW power with 412.47 ns delay.

Full Text