Abstract

In Markov games, accurately detecting opponent policies and reusing optimal response policies is still a challenging problem. Most previous works assume that opponents switch their policies infrequently only at the end of an episode. However, the opponents may change their policies at high-frequency or even within an episode. Besides, the agent may achieve inconsistent optimal returns because of different opponent behaviors, which brings greater challenges to policy detection. This paper studies how to deal with the non-stationary opponent with abrupt policy changes through accurate policy detection and direct policy reuse. Specifically, we propose a context-aware Bayesian policy reuse (CABPR) algorithm to accurately identify and track the multi-strategic opponent. To continuously infer the opponent policy, an intra-episode belief is introduced taking advantage of opponent models. Within an episode, an inter-episode belief using Bayesian inference and the intra-episode belief are jointly used to detect the opponent type based on its behaviors and episodic rewards. Then the agent reuses the best response policies accordingly. We demonstrate the advantages of the proposed algorithm over several state-of-the-art algorithms in terms of episodic rewards, accumulated rewards, and detection accuracy in four competitive scenarios.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call