Abstract

The stochastic dynamics of reinforcement learning is studied using a master equationformalism. We consider two different problems-Q learning for a two-agent game and the multiarmed bandit problem with policy gradient as the learning method. The master equationis constructed by introducing a probability distribution over continuous policy parameters or over both continuous policy parameters and discrete state variables (a more advanced case). We use a version of the moment closure approximation to solve for the stochastic dynamics of the models. Our method gives accurate estimates for the mean and the (co)variance of policy variables. For the case of the two-agent game, we find that the variance terms are finite at steady state and derive a system of algebraic equationsfor computing them directly.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call