Abstract

This paper proposes a novel model-free inverse reinforcement learning method based on density ratio estimation under the framework of Dynamic Policy Programming. We show that the logarithm of the ratio between the optimal policy and the baseline policy is represented by the state-dependent cost and the value function. Our proposal is to use density ratio estimation methods to estimate the density ratio of policies and the least squares method with regularization to estimate the state- dependent cost and the value function that satisfies the relation. Our method can avoid computing the integral such as evaluating the partition function. A simple numerical simulation of a grid world navigation, a car driving, and a pendulum swing-up shows its superiority over conventional methods. I. INTRODUCTION Reinforcement Learning (RL) is a computational framework for investigating decision-making processes of both biological and artificial systems that can learn an optimal policy by interacting with an environment (1). There exist several open questions in RL, and one of the critical problems is how we design and prepare an appropriate reward/cost function. It is easy to design a sparse reward function which gives a positive reward when the task is accomplished and zero otherwise, but that makes it hard to find an optimal policy. In some situations, it is easier to prepare examples of a desired behavior than to handcraft an appropriate reward/cost function. Recently, several methods of Inverse Reinforcement Learning (IRL) (2) and apprenticeship learning (3) have been proposed in order to derive a reward/cost function from demonstrator's behaviors. IRL provides a powerful way to implement imitation learning, and it is also a promising ap- proach for understanding the learning processes of biological systems because the reward/cost function specifies the goal of the behavior. However, most of the existing studies (3), (4) require a routine to solve forward reinforcement learning problems with estimated reward/cost functions. This process is usually very time consuming even when the model of the environment is available. Recently, IRL algorithms without solving forward reinforce- ment learning are proposed. The key idea of these methods is to introduce the Kullback-Leibler (KL) divergence. For exam- ple, OptV (5) is the model-based IRL which is derived from the concept of Linearly solvable Markov Decision Process (LMDP) (6). LMDP is a sub-class of Markov Decision Process

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.