Abstract

The target net method has been the foundation of deep reinforcement learning since Deepmind first proposed it in 2015. Almost all the current popular reinforcement learning algorithms include target net. However, while the slowly updated target network improves the stability of the algorithm, it also reduces the performance of the algorithm. In this paper, the authors design a novel triple-network algorithm(TPN). TPN combines the temporal-difference(TD) algorithm and policy gradient(PG) theorem. Using three networks to estimate the state value(v), action value (q), and policy(π). These networks have no primary or secondary distinction but are trained synchronously and influence each other. The author found that through this TPN architecture, the convergence and stability of the algorithm can be greatly improved without increasing the amount of calculation. Although it is only a basic framework at present. The calculation process of TPN is simple and easy to implement. Experiments prove that the convergence speed and stability of TPN in discrete cases are better than PPO.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call