Abstract

The traditional deep deterministic policy gradient (DDPG) algorithm has the disadvantages of slow convergence velocity and ease of falling into the local optimum. From these two perspectives, a DDPG algorithm based on the double network prioritized experience replay mechanism (DNPER-DDPG) is proposed in this paper. Firstly, the value function is approximated by introducing the idea of two neural networks, and the minimum of the action value functions generated by the two networks is selected as the updated value of the actor policy network, which reduces the incidence of local optimal policy. Then, the Q values obtained by the two networks and the immediate reward obtained by the environment are used as the criteria for prioritization, and the importance of the samples in the experience replay mechanism is divided to improve the convergence speed of the algorithm. Finally, the improved method is demonstrated in the classic control environment of OpenAI Gym, and the results show that the proposed method has increased convergence speed and cumulative reward compared with the comparison algorithm.

Highlights

  • With the development of the field of artificial intelligence, several achievements in the area of discrete action spaces have been made by reinforcement learning [1]–[3]

  • A deep deterministic policy gradient algorithm based on a double network prioritized experience replay (ER) mechanism is proposed in this paper

  • To reduce the probability of getting stuck in a local optimum, two critic networks are introduced into the structure of the algorithm

Read more

Summary

INTRODUCTION

With the development of the field of artificial intelligence, several achievements in the area of discrete action spaces have been made by reinforcement learning [1]–[3]. C. Kang et al.: DDPG Based on Double Network Prioritized Experience Replay reinforcement learning idea. The stochastic weight averaging method was introduced to reduce the influence of noise in the gradient estimator in training process It was tested in the continuous action space tasks of Atari and MuJoCo. The stability of the training process is thereby increased. The parallel actor network was introduced to speed up training efficiency and the prioritized experience replay was introduced to raise sample utilization. A reinforcement learning method combining prioritized experience replay and DDPG was proposed in [13]. The basic principle of DDPG is introduced, and its network structure and important parameters are elaborated in detail to determine its defects in processing continuous action space tasks. To improve the VOLUME 9, 2021 convergence of the algorithm, the priority function of samples in the experience replay mechanism is proposed.

DEEP DETERMINISTIC POLICY GRADIENT
PENDULUM CONTROL BASED ON IMPROVED ALGORITHM
Findings
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.