Reinforcement and Imitation Learning Applied to Autonomous Aerial Robot Control

Gabriel Moraes Barros,Esther L Colombini

doi:10.5753/wtdr_ctdr.2020.14956

Abstract

In robotics, the ultimate goal of reinforcement learning is to endow robots with the ability to learn, improve, adapt, and reproduce tasks with dynamically changing constraints based on exploration and autonomous learning. Reinforcement Learning (RL) aims at addressing this problem by enabling a robot to learn behaviors through trial-and-error. With RL, a Neural Network can be trained as a function approximator to directly map states to actuator commands making any predefined control structure not-needed for training. However, the knowledge required to converge these methods is usually built from scratch. Learning may take a long time, not to mention that RL algorithms need a stated reward function. Sometimes, it is not trivial to define one. Often it is easier for a teacher, human or intelligent agent, do demonstrate the desired behavior or how to accomplish a given task. Humans and other animals have a natural ability to learn skills from observation, often from merely seeing these skills’ effects: without direct knowledge of the underlying actions. The same principle exists in Imitation Learning, a practical approach for autonomous systems to acquire control policies when an explicit reward function is unavailable, using supervision provided as demonstrations from an expert, typically a human operator. In this scenario, this work’s primary objective is to design an agent that can successfully imitate a prior acquired control policy using Imitation Learning. The chosen algorithm is GAIL since we consider that it is the proper algorithm to tackle this problem by utilizing expert (state, action) trajectories. As reference expert trajectories, we implement state-of-the-art on and off-policy methods PPO and SAC. Results show that the learned policies for all three methods can solve the task of low-level control of a quadrotor and that all can account for generalization on the original tasks.

Full Text