Integrating Multiple Policies for Person-Following Robot Training Using Deep Reinforcement Learning

Chandra Kusuma Dewa,Jun Miura

doi:10.1109/access.2021.3082136

Abstract

Given a training environment which follows Markov decision process for a specific task, a deep reinforcement learning (DRL) agent is able to find possible optimal policies which map states of the environment to appropriate actions by repeatedly trying various actions to maximize training rewards. However, the learned policies cannot be reused directly in the training process for other new tasks resulting wasted precious time and resources. To solve this problem, we propose a DRL-based method for training an agent capable of selecting the appropriate policy for current state of the environment from a set of previously trained optimal policies for a given task which can be decomposed into other sub tasks. We implement our proposed method to a person-following robot task training that can be broken down into three sub tasks, namely: navigation, left attending, and right attending. Using the proposed method, the previously learned optimal navigation policy obtained from our previous work is integrated to attending policies which are trained in this study. We also introduce the use of weight-scheduled action smoothing which is able to stabilize actions generated by the agent in the attending task training. Our experiment results show that the proposed method is able to integrate all sub policies using the action smoothing method even though the navigation and the attending policies have dissimilar input structures, unalike output ranges, and are trained in different ways. Moreover, our proposed method shows better results compared to training from scratch and training using transfer learning strategy.

Highlights

In line with the rapid growth in the fields of robotics, there are increasing demands for service robots which have the main objective to assist and stay close with humans for supporting their daily needs
Since we have already obtained the optimal navigation policy from our previous work [15] and we want to reuse the policy in this study, we extend the prior work of Frans et al [14] so that it is capable of reusing optimal policies which correspond to all sub tasks in a complex environment and have been trained beforehand
We first describe the details of the person-following robot environment and present the weight-scheduled action smoothing, which is the method that we propose during the attending task training procedure

Summary

Introduction

In line with the rapid growth in the fields of robotics, there are increasing demands for service robots which have the main objective to assist and stay close with humans for supporting their daily needs. Developments for particular robots which have the main ability to follow and attend a specified target person are desired. There are still many challenges that have to be dealt with when developing this type of partner-robots since person-following is not just a simple and ordinary task [1], [2]. One of abilities a person-following robot must have is the navigation skill. Suppose that the robot is in the position of away from the target person, it has to be able to perform a motion planning for finding safe paths in order to navigate approaching him while avoiding surrounding obstacles. Once it’s position is near to the target person, the robot subsequently has to find the most appropriate position towards him to perform the attending task appropriately.

Objectives

Results

Conclusion