Cross-Entropy Regularized Policy Gradient for Multirobot Nonadversarial Moving Target Search

Hongliang Guo,Zhaokai Liu,Daniela Rus,Rui Shi,Wei-Yun Yau

doi:10.1109/tro.2023.3263459

Abstract

This article investigates the multirobot efficient search (MuRES) for a nonadversarial moving target problem from the multiagent reinforcement learning (MARL) perspective. MARL is deemed as a promising research field for cooperative multiagent applications. However, one of the main bottlenecks of applying MARL to the MuRES problem is the nonstationarity introduced by multiple learning agents. With learning agents simultaneously updating their policies, the environment cannot be modeled as a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">stationary</i> Markov decision process, which results in the inapplicability of fundamental reinforcement learning techniques such as deep <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$Q$</tex-math></inline-formula> -network and policy gradient (PG). In view of that, we adopt the centralized training and decentralized execution scheme and thereby propose a cross-entropy regularized policy gradient (CE-PG) method to train the learning agents/robots. We let the robots <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">commit</i> to a predetermined policy during execution, collect the trajectories, and then perform centralized training for the corresponding policy improvement. In this way, the nonstationarity problem is overcome, in that the robots do not update their policies during execution. During the centralized training stage, we improve the canonical PG method to consider the interactions among robots by adding a cross-entropy regularization term, which essentially functions to “disperse” the robots in the environment. Extensive simulation results and comparisons with state of the art show CE-PG's superior performance, and we also validate the algorithm with a real multirobot system in an indoor moving target search scenario.

Full Text