Off-Policy Reinforcement Learning for Robotics

Samuele Tosatto

doi:10.26083/tuprints-00017536

Abstract

Nowadays, industrial processes are vastly automated by means of robotic manipulators. In some cases, robots occupy a large fraction of the production line, performing a rich range of tasks. In contrast to their tireless ability to repeatedly perform the same tasks with millimetric precision, current robotics exhibits low adaptability to new scenarios. This lack of adaptability in many cases hinders a closer human-robot interaction; furthermore, when one needs to apply some change to the production line, the robots need to be reconfigured by highly-qualified figures. Machine learning and, more particularly, reinforcement learning hold the promise to provide automated systems that can adapt to new situations and learn new tasks. Despite the overwhelming progress in recent years in the field, the vast majority of reinforcement learning is not directly applicable to real robotics. State-of-the-art reinforcement learning algorithms require intensive interaction with the environment and are unsafe in the early stage of learning when the policy perform poorly and potentially harms the systems. For these reasons, the application of reinforcement learning has been successful mainly on simulated tasks such as computer- and board-games, where it is possible to collect a vast amount of samples in parallel, and there is no possibility to damage any real system. To mitigate these issues, researchers proposed first to employ imitation learning to obtain a reasonable policy, and subsequently to refine it via reinforcement learning. In this thesis, we focus on two main issues that prevent the mentioned pipe-line from working efficiently: (i) robotic movements are represented with a high number of parameters, which prevent both safe and efficient exploration; (ii) the policy improvement is usually on-policy, which also causes inefficient and unsafe updates. In Chapter 3 we propose an efficient method to perform dimensionality reduction of learned robotic movements, exploiting redundancies in the movement spaces (which occur more commonly in manipulation tasks) rather than redundancies in the robot kinematics. The dimensionality reduction allows the projection to latent spaces, representing with high probability movements close to the demonstrated ones. To make reinforcement learning safer and more efficient, we define the off-policy update in the movement’s latent space in Chapter 4. In Chapter 5, we propose a novel off-policy gradient estimation, which makes use of a particular non-parametric technique named Nadaraya-Watson kernel regression. Building on a solid theoretical framework, we derive statistical guarantees. We believe that providing strong guarantees is at the core of a safe machine learning. In this spirit, we further expand and analyze the statistical guarantees on Nadaraya-Watson kernel regression in Chapter 6. Usually, to avoid challenging exploration in reinforcement learning applied to robotics, one must define highly engineered reward-function. This limitation hinders the possibility of allowing non-expert users to define new tasks. Exploration remains an open issue in high-dimensional and sparse reward. To mitigate this issue, we propose a far-sighted exploration bonus built on information-theoretic principles in Chapter 7. To test our algorithms, we provided a full analysis both on simulated environment, and in some cases on real world robotic tasks. The analysis supports our statement, showing that our proposed techniques can safely learn in the presence of a limited set of demonstration and robotic interactions.

Full Text