Abstract

Policy Space Response Oracles (PSRO) is a powerful tool for large two-player zero-sum games, which is based on the tabular Double Oracle (DO) method and has achieved state-of-the-art performance. Though having guarantee to converge to a Nash equilibrium, existing PSRO and its variants suffer from two drawbacks: (1) exponential growth of the number of iterations and (2) serious performance oscillation before convergence. To address these issues, this paper proposes Efficient Double Oracle (EDO), a tabular double oracle algorithm for extensive-form two-player zero-sum games, which is guaranteed to converge linearly in the number of infostates while decreasing exploitability every iteration. To this end, EDO first mixes best responses at every infostate so that it can make full use of current policy population and significantly reduce the number of iterations. Moreover, EDO finds the restricted policy for each player that minimizes its exploitability against an unrestricted opponent. Finally, we introduce Neural EDO (NEDO) to scale up EDO to large games, where the best response and the meta-NE are learned through deep reinforcement learning. Experiments on Leduc Poker and Kuhn Poker show that EDO achieves a lower exploitability than PSRO and XFP with the same amount of computation. We also find that NEDO outperforms PSRO and NXDO empirically on Leduc Poker and different versions of Tic Tac Toe.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.