Abstract

Although the combination of adaptive methods and deep reinforcement learning has achieved tremendous success in practical applications, its theoretical convergence properties are not well understood. To address this issue, we propose a neural network-based adaptive TD algorithm, called NTD-AMSGrad, which is a variant of temporal difference learning. Moreover, we rigorously analyze the convergence performance of the proposed algorithm and establish a finite-time bound for NTD-AMSGrad under the Markov observation model. Specifically, when the neural network is wide enough, the proposed algorithm can converge to the optimal action-value function at a rate of, where is the number of iterations.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call