Abstract

This paper focuses on filling the gap between strategy evaluation and strategy learning in two-player symmetric games, as a learning algorithm may converge to the strategies not preferred by an evaluation metric. When a player determines its strategies, it needs to first evaluate candidate strategies without knowing the opponents' decisions. Then, based on the result of the evaluation, a preferred strategy is selected. On the contrary, many multi-agent reinforcement learning algorithms are constructed provided that the strategies of other players are known in each training episode. In this paper, we first introduce two graph-based metrics grounded on sink equilibrium to characterize the preferred strategies of the players in strategy evaluation. These metrics can be regarded as generalized solution concepts in games. Then, we propose two variants of the classical self-play algorithm, named strictly best-response and weakly better-response self-plays, to learn the strategies for the players. By modeling the learning processes as walks over joint-strategy response digraphs, we prove that under some conditions, the learned strategies by two variants are the preferred strategies under two metrics, respectively, which thus fills the evaluation–learning gap, and ensures that the preferred strategies are learned. We also investigate the relationship between the two metrics.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call