Inverse reinforcement learning using Dynamic Policy Programming

Eiji Uchibe,Kenji Doya

doi:10.1109/devlrn.2014.6982985

Abstract

This paper proposes a novel model-free inverse reinforcement learning method based on density ratio estimation under the framework of Dynamic Policy Programming. We show that the logarithm of the ratio between the optimal policy and the baseline policy is represented by the state-dependent cost and the value function. Our proposal is to use density ratio estimation methods to estimate the density ratio of policies and the least squares method with regularization to estimate the state- dependent cost and the value function that satisfies the relation. Our method can avoid computing the integral such as evaluating the partition function. A simple numerical simulation of a grid world navigation, a car driving, and a pendulum swing-up shows its superiority over conventional methods. I. INTRODUCTION Reinforcement Learning (RL) is a computational framework for investigating decision-making processes of both biological and artificial systems that can learn an optimal policy by interacting with an environment (1). There exist several open questions in RL, and one of the critical problems is how we design and prepare an appropriate reward/cost function. It is easy to design a sparse reward function which gives a positive reward when the task is accomplished and zero otherwise, but that makes it hard to find an optimal policy. In some situations, it is easier to prepare examples of a desired behavior than to handcraft an appropriate reward/cost function. Recently, several methods of Inverse Reinforcement Learning (IRL) (2) and apprenticeship learning (3) have been proposed in order to derive a reward/cost function from demonstrator's behaviors. IRL provides a powerful way to implement imitation learning, and it is also a promising ap- proach for understanding the learning processes of biological systems because the reward/cost function specifies the goal of the behavior. However, most of the existing studies (3), (4) require a routine to solve forward reinforcement learning problems with estimated reward/cost functions. This process is usually very time consuming even when the model of the environment is available. Recently, IRL algorithms without solving forward reinforce- ment learning are proposed. The key idea of these methods is to introduce the Kullback-Leibler (KL) divergence. For exam- ple, OptV (5) is the model-based IRL which is derived from the concept of Linearly solvable Markov Decision Process (LMDP) (6). LMDP is a sub-class of Markov Decision Process

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Inverse reinforcement learning using Dynamic Policy Programming

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Reinforcement Learning for Clinical Applications.
Kia Khezeli ... Benjamin Shickel
Clinical journal of the American Society of Nephrology : CJASN | VOL. 18
Kia Khezeli, et. al.Kia Khezeli ... Benjamin Shickel
08 Feb 2023
Clinical journal of the American Society of Nephrology : CJASN | VOL. 18

Sparse online maximum entropy inverse reinforcement learning via proximal optimization and truncated gradient
Li Song ... Xin Xu
Knowledge-Based Systems | VOL. 252
Li Song, et. al.Li Song ... Xin Xu
16 Jul 2022
Knowledge-Based Systems | VOL. 252

Proposal and Evaluation of the Improved Penalty Avoiding Rational Policy Making Algorithm
...
-
, et. al. ...
01 Jan 2009
01 Jan 2009

Contraction Mappings in the Theory Underlying Dynamic Programming
Eric V Denardo
SIAM Review | VOL. 9
Eric V DenardoEric V Denardo
01 Apr 1967
SIAM Review | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Inverse reinforcement learning using Dynamic Policy Programming

Abstract

Talk to us

Similar Papers