Abstract

In inverse reinforcement learning (RL), there are two agents. An expert target agent has a performance cost function and exhibits control and state behaviors to a learner. The learner agent does not know the expert's performance cost function but seeks to reconstruct it by observing the expert's behaviors and tries to imitate these behaviors optimally by its own response. In this article, we formulate an imitation problem where the optimal performance intent of a discrete-time (DT) expert target agent is unknown to a DT Learner agent. Using only the observed expert's behavior trajectory, the learner seeks to determine a cost function that yields the same optimal feedback gain as the expert's, and thus, imitates the optimal response of the expert. We develop an inverse RL approach with a new scheme to solve the behavior imitation problem. The approach consists of a cost function update based on an extension of RL policy iteration and inverse optimal control, and a control policy update based on optimal control. Then, under this scheme, we develop an inverse reinforcement Q-learning algorithm, which is an extension of RL Q-learning. This algorithm does not require any knowledge of agent dynamics. Proofs of stability, convergence, and optimality are given. A key property about the nonunique solution is also shown. Finally, simulation experiments are presented to show the effectiveness of the new approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call