Abstract

In Markov games, playing against non-stationary opponents with learning ability is still challenging for reinforcement learning (RL) agents, because the opponents can evolve their policies concurrently. This increases the complexity of the learning task and slows down the learning speed of RL agents. This paper proposes efficient use of rough heuristics to speed up policy learning when playing against concurrent learners. Specifically, we propose an algorithm that can efficiently learn explainable and generalized action selection rules by taking advantage of representations of quantitative heuristics and an opponent model with an eXtended classifier system (XCS) in zero-sum Markov games. A neural network is used to model the opponent from their behaviors and the corresponding policy is inferred for action selection and classifiers evolution. Due to the condition representation and the matching mechanism of XCS, the quantitative heuristics and the opponent model can provide guidance for action selection in states with similar feature representations. Besides, in case of multiple heuristic policies, we introduce the concept of Pareto optimality to comprehensively consider all the heuristic policies for action selection. Moreover, we analyze the influence of heuristic policies on the convergence of the algorithm. Furthermore, we introduce accuracy-based eligibility traces to further speed up classifiers evolution, i.e., in the reinforcement part, classifiers that can match historical traces are reinforced according to their accuracy. We demonstrate the advantages of the proposed algorithm over several benchmark algorithms in a soccer scenario and a thief-and-hunter scenario.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call