Abstract

Human-computer gaming under incomplete information is usually described by a two-player zero-sum game model. Counterfactual regret minimization (CFR) is a popular algorithm for two-player zero-sum games with incomplete information. However, the existing CFR and its variant algorithms use fixed regret calculation and strategy update type in the iteration process, which have their advantages and disadvantages in the incomplete information extensive game, and their generalization performance is weak. To solve this problem, this paper combines the proximal policy optimization (PPO) algorithm in reinforcement learning with the CFR algorithm to train rational agents to adaptively select appropriate regret calculation and strategy update types in the CFR iteration process to improve the generalization performance of the current CFR algorithms and realize the policy optimization of the incomplete information extensive game. In this paper, general poker game experiments are used to verify the proposed algorithm, and a stepwise reward function is formulated to train the action policy of the agent. Experimental results show that compared with existing state-of-the-art methods, the PPO-CFR algorithm has better generalization performance and lower exploitability, and the iteration policy is closer to the Nash equilibrium policy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call