Automation of noise sampling in deep reinforcement learning

Varun Gupta,Kunal Karda,Namit Dubey,Abhas Kanungo

doi:10.1504/ijapr.2022.10046327

Abstract

The actor-critic models are generally prone to overestimation of sub-optimal policies and Q-values. Our proposed approach is established on value-based deep reinforcement learning algorithm also known as twin delayed deep deterministic policy gradient algorithm or TD3. The suggested approach is used to solve complex reinforcement learning problem like half-humanoid robot, ant, and half-cheetah to cover a path. This problem can only be solved with an algorithm which can work on continuous-action spaces, without much delaying the result to propagate during the inference of model. The proposed model has been adapted to converge faster to optimal Q-values. The TD3 uses two deep neural networks for learning two Q-values, viz., Q1 and Q2; in the proposed approach the Q-values average is being taken as an input for final Q-value unlike the other reinforcement learning algorithm such as DDPG which is prone to overestimate the Q-values. The proposed approach has also made self-adjusting noise clipping function, which make it harder for the policy to exploit Q-function errors to further improve performance.

Full Text