Efficient Double Oracle for Extensive-Form Two-Player Zero-Sum Games

Yihong Huang,Cheng Zhao,Liansheng Zhuang,Haonan Liu

doi:10.1007/978-3-031-30108-7_35

Abstract

Policy Space Response Oracles (PSRO) is a powerful tool for large two-player zero-sum games, which is based on the tabular Double Oracle (DO) method and has achieved state-of-the-art performance. Though having guarantee to converge to a Nash equilibrium, existing PSRO and its variants suffer from two drawbacks: (1) exponential growth of the number of iterations and (2) serious performance oscillation before convergence. To address these issues, this paper proposes Efficient Double Oracle (EDO), a tabular double oracle algorithm for extensive-form two-player zero-sum games, which is guaranteed to converge linearly in the number of infostates while decreasing exploitability every iteration. To this end, EDO first mixes best responses at every infostate so that it can make full use of current policy population and significantly reduce the number of iterations. Moreover, EDO finds the restricted policy for each player that minimizes its exploitability against an unrestricted opponent. Finally, we introduce Neural EDO (NEDO) to scale up EDO to large games, where the best response and the meta-NE are learned through deep reinforcement learning. Experiments on Leduc Poker and Kuhn Poker show that EDO achieves a lower exploitability than PSRO and XFP with the same amount of computation. We also find that NEDO outperforms PSRO and NXDO empirically on Leduc Poker and different versions of Tic Tac Toe.

Full Text