Continuous Deep Maximum Entropy Inverse Reinforcement Learning using online POMDP

Junior A. R. Silva,Valdir Grassi,Denis Fernando Wolf

doi:10.1109/icar46387.2019.8981548

Abstract

A vehicle navigating in an urban environment must obey traffic rules by properly setting its speed, such as staying below the road speed limit and avoiding collision with other vehicles. This is presumably the scenario that autonomous vehicles will face: they will share the traffic roads with other vehicles (autonomous or not), cooperatively interacting with them. In other words, autonomous vehicles should not only follow traffic rules, but should also behave in such a way that resembles other vehicles behavior. However, manually specification of such behavior is a time-consuming and error-prone task, since driving in urban roads is a complex task, which involves many factors. This paper presents a multitask decision making framework that learns an expert driver's behavior driving in an urban scenario containing traffic lights and other vehicles. For this purpose, Inverse Reinforcement Learning (IRL) is used to learn a reward function that explains the expert driver's behavior. Most IRL approaches require solving a Markov Decision Process (MDP) in each iteration of the algorithm to compute the optimal policy given the current rewards. Nevertheless, the computational cost of solving an MDP is high when considering large state spaces. To overcome this issue, the optimal policy is estimated by sampling trajectories in regions of the space with higher rewards. To do so, the problem is modeled as a continuous Partially Observed Markov Decision Process (POMDP), in which the intentions of other vehicles are only partially observed. An online solver is employed in order to sample trajectories given the current rewards. The efficiency of the proposed framework is demonstrated through simulations, showing that the controlled vehicle is be able to mimic an expert driver's behavior.

Full Text