Abstract

Multi-agent combat is a combat scenario in multiagent reinforcement learning (MARL). In this combat, agents use reinforcement learning methods to learn optimal policies. Actually, policy may be changed, which leads to a non-stationary environment. In this case, it is difficult to predict opponents' policies. Many reinforcement learning methods try to solve nonstationary problems. Most of the previous works put all agents into a frame and model their policies to deal with non-stationarity of environments. But, in a combat environment, opponents can not be in the same frame as our agents. We group opponents and our agents into two frames, only considering opponents as a part of the environment. In this paper, we focus on the problem of modelling opponents' policies in non-stationary environments. To solve this problem, we propose an algorithm called Additional Opponent Characteristics Multi-agent Deep Deterministic Policy Gradient (AOC-MADDPG) with the following contributions: (1) We propose a new actor-critic framework to deal with nonstationarity of environments in MARL, so that agents can adapt to more complex environments. (2) A model for opponents' policies is built by introducing observations and actions of the opponents into the critic network as additional characteristics. We evaluate our AOC-MADDPG algorithm in two multi-agent combat environments. As a result, our approach significantly outperforms the baseline. Agents trained by our method can get higher rewards in non-stationary environments.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call