Sample-Efficient I-Projections for Robot Learning

Oleg Arenz

doi:10.12921/tuprints-00014271

Abstract

Robots had a great impact on the manufacturing industry ever since the early seventies when companies such as KUKA and ABB started deploying their first industrial robots. These robots merely performed very specific tasks in specific ways within well-defined environments. Still, they proved to be very useful as they could exceed human performance at these tasks. However, in order to enable robots to enter our daily life, they need to become more versatile and need to operate in much less structured environments. This thesis is partly devoted to stretching these limitations by means of learning, namely imitation learning (IL) and inverse reinforcement learning (IRL). Reinforcement learning (RL) is a powerful approach to enable robots to solve a task in an unknown environment. The practitioner describes a desired behavior by specifying a reward function and the robot autonomously interacts with the environment in order to find a control policy that generates high accumulated reward. However, RL is not suitable for teaching new tasks by non-experts because specifying appropriate reward functions can be difficult. Demonstrating the desired behavior is often easier for non-experts. Imitation learning can be used in order to enable the robot to reproduce the demonstrations. However, without explicitly inferring and modeling the intentions of the demonstrations, it can become difficult to solve the task for unseen situations. Inverse reinforcement learning (IRL) therefore aims to infer a reward function from the demonstrations, such that optimizing this reward function yields the desired behavior even for different situations. This thesis introduces a unifying approach to solve the inverse reinforcement learning problem in the same way as the reinforcement learning problem. This is achieved by framing both problems as information projection problems, i.e., we strive to minimize the relative entropy between a probabilistic model of the robot behavior and a given desired distribution. Furthermore, a trust region on the robot behavior is used to stabilize the optimization. For inverse reinforcement learning, the desired distribution is implicitly given by the expert demonstrations. The resulting optimization can be efficiently solved using state-of-the-art reinforcement learning methods. For reinforcement learning, the log-likelihood of the desired distribution is given by the reward function. The resulting optimization problem corresponds to a standard reinforcement learning formulation, except for an additional objective of maximizing the entropy of the robot behavior. This entropy objective adds little overhead to the optimization, but can lead to better exploration and more diversified policies. Trust-region I-projections are not only useful for training robots, but can also be applied to other machine learning problems. I-projections are typically used for variational inference, in order to approximate an intractable distribution by a simpler model. However, the resulting optimization problems are usually optimized based on stochastic gradient descent which often suffers from high variance in the gradient estimates. As trust-region I-projections where shown to be effective for reinforcement learning and inverse reinforcement learning, this thesis also explores their use for variational inference. More specifically, trust-region I-projections are investigated for the problem of approximating an intractable distribution by a Gaussian mixture model (GMM) with an adaptive number of components. GMMs are highly desirable for variational inference because they can yield arbitrary accurate approximations while inference from GMMs is still relatively cheap. In order to make learning the GMM feasible, we derive a lower bound that enables us to decompose the objective function. The optimization can then be performed by iteratively updating individual components using a technique from reinforcement learning. The resulting method is capable of learning approximations of significantly higher quality than existing variational inference methods. Due to the similarity of the underlying optimization problems, the insights gained from our variational inference method are also useful for IL and IRL. Namely, a similar lower bound can be applied also for the I-projection formulation of imitation learning. However, whereas for variational inference the lower bound serves to decompose the objective function, for imitation learning it allows us to provide a reward signal to the robot that does not depend on its behavior. Compared to reward functions that are relative to the current behavior of the robot---which are typical for popular adversarial methods---behavior-independent reward functions have the advantages that we can show convergence even for greedy optimization. Furthermore, behavior-independent reward functions solve the inverse reinforcement learning problem, thereby closing the gap between imitation learning and IRL. However, algorithms derived from our non-adversarial formulation are actually very similar to existing AIL methods, and we can even show that adversarial inverse reinforcement learning (AIRL) is indeed an instance of our formulation. AIRL was derived from an adversarial formulation, and we point out several problems of that derivation. In contrast, we show that AIRL can be straightforwardly derived from out non-adversarial formulation. Furthermore, we demonstrate that the non-adversarial formulation can be also used to derive novel algorithms by presenting a non-adversarial method for offline imitation learning.

Full Text