Kernel dynamic policy programming: Practical reinforcement learning for high-dimensional robots

Yunduan Cui,Kenji Sugimoto,Takamitsu Matsubara

doi:10.1109/humanoids.2016.7803345

Abstract

Applying value function based reinforcement learning algorithms to real robots has been infeasible because the approximation of high-dimensional value function is difficult. The difficulty of such high-dimensional value function approximation in previous methods are twofold: 1) instability of value function approximation by non-smooth policy update and 2) computational complexity associated with high-dimensional state-action space. To cope with these issues, in this paper, we propose Kernel Dynamic Policy Programming (KDPP) that smoothly updates value function in an implicit high-dimensional feature space. The smooth policy update is promoted by adding the Kullback-Leibler divergence between current and updated policies in reward function as a regularization term to stabilize the value function approximation. The computational complexity is reduced by applying the kernel trick in the value function approximation. Therefore, KDPP can be interpreted as a novel yet practical extension of Dynamic Policy Programming (DPP) and kernelized value function-based reinforcement learning methods to combine the strengths of them. We successfully applied KDPP to learn unscrewing bottle cap in a Pneumatic Artificial Muscles (PAMs) driven humanoid robot hand, a system with 24 dimensional state space, with limited number of samples and commonplace computational resource.

Full Text