Bridging the Gap Between Imitation Learning and Inverse Reinforcement Learning.

Bilal Piot,Olivier Pietquin,Matthieu Geist

doi:10.1109/tnnls.2016.2543000

Abstract

Learning from demonstrations is a paradigm by which an apprentice agent learns a control policy for a dynamic environment by observing demonstrations delivered by an expert agent. It is usually implemented as either imitation learning (IL) or inverse reinforcement learning (IRL) in the literature. On the one hand, IRL is a paradigm relying on the Markov decision processes, where the goal of the apprentice agent is to find a reward function from the expert demonstrations that could explain the expert behavior. On the other hand, IL consists in directly generalizing the expert strategy, observed in the demonstrations, to unvisited states (and it is therefore close to classification, when there is a finite set of possible decisions). While these two visions are often considered as opposite to each other, the purpose of this paper is to exhibit a formal link between these approaches from which new algorithms can be derived. We show that IL and IRL can be redefined in a way that they are equivalent, in the sense that there exists an explicit bijective operator (namely, the inverse optimal Bellman operator) between their respective spaces of solutions. To do so, we introduce the set-policy framework that creates a clear link between the IL and the IRL. As a result, the IL and IRL solutions making the best of both worlds are obtained. In addition, it is a unifying framework from which existing IL and IRL algorithms can be derived and which opens the way for the IL methods able to deal with the environment's dynamics. Finally, the IRL algorithms derived from the set-policy framework are compared with the algorithms belonging to the more common trajectory-matching family. Experiments demonstrate that the set-policy-based algorithms outperform both the standard IRL and IL ones and result in more robust solutions.

Highlights

Because of the growing interest in robotics and other complex systems, new machine learning methods for model-free control have been the subject of many researches this last decade
From the trajectory matching framework, selected algorithms are the Projection algorithm [22] (Proj), the Multiplicative Weight Apprenticeship Learning algorithm [23] (MWAL) and the Max Margin Planning algorithm [38] adapted to Inverse Reinforcement Learning (IRL) [25] (MMP)
We have presented an original paradigm, called the set-policy framework, that establishes a formal link between score-based Imitation Learning (IL) methods and IRL methods

Summary

Introduction

Because of the growing interest in robotics and other complex systems (such as interactive systems), new machine learning methods for model-free control have been the subject of many researches this last decade. The LfD problem is placed in the framework of Markov Decision Processes (MDP) [6], [7] In machine learning, this is a standard framework for describing a dynamical system and learning an optimal controller for it. The system is described as a set of states, actions and transitions and the learned behavior takes the form of a policy, mapping from states to actions This is done via Reinforcement Learning [8] where, after each action, the learning agent is provided with an immediate numerical feedback (called a reward) depending on the quality of the action. In the MDP framework, solving the LfD problem consists in learning a policy performing as well as the expert agent but using a finite set of demonstrations instead of the reward signal. If π is deterministic, with an abuse of notation, π will represent π ∈ ∆SA or its counterpart πD ∈ AS depending on the context

Objectives

Methods

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE transactions on neural networks	Publication Date: May 4, 2016
Citations: 52	License type: cc-by

R Discovery Prime

R Discovery Prime

Bridging the Gap Between Imitation Learning and Inverse Reinforcement Learning.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE transactions on neural networks

Lead the way for us

Similar Papers

Sparse online maximum entropy inverse reinforcement learning via proximal optimization and truncated gradient
Li Song ... Xin Xu
Knowledge Based Systems | VOL. 252
Li Song, et. al.Li Song ... Xin Xu
16 Jul 2022
Knowledge Based Systems | VOL. 252

Survey of imitation learning: tradition and new advances
Zhang Chao ... Du Xin
Journal of Image and Graphics | VOL. 28
Zhang Chao, et. al.Zhang Chao ... Du Xin
01 Jan 2023
Journal of Image and Graphics | VOL. 28

Sample-Efficient I-Projections for Robot Learning

-

19 Apr 2021
19 Apr 2021

Inverse reinforcement learning using Dynamic Policy Programming
Eiji Uchibe ... Kenji Doya
-
Eiji Uchibe, et. al.Eiji Uchibe ... Kenji Doya
01 Oct 2014
01 Oct 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Bridging the Gap Between Imitation Learning and Inverse Reinforcement Learning.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE transactions on neural networks