Abstract

In this work we consider the problem of policy optimization in the context of reinforcement learning. In order to avoid discretization, we select the optimal policy to be a continuous function belonging to a reproducing Kernel Hilbert Space (RKHS) which maximizes an expected cumulative reward (ECR). We design a policy gradient algorithm (PGA) in this context, deriving the gradients of the functional ECR and learning the unknown state transition probabilities on the way. In particular, we propose an unbiased stochastic approximation for the gradient that requires a finite number of steps. This unbiased estimator is the key enabler for a novel stochastic PGA, which provably converges to a critical point of the ECR. However, the RKHS approach increases the model order per iteration by adding extra kernels, which may render the numerical complexity prohibitive. To overcome this limitation, we prune the kernel dictionary using an orthogonal matching pursuit procedure, and prove that the modified method keeps the model order bounded for all iterations, while ensuring convergence to a neighborhood of the critical point.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.