Abstract

This paper focuses on Multi-Agent Reinforcement Learning (MARL) in non-cooperative stochastic games, particularly addressing the challenge of task completion characterized by non-Markovian reward functions. We employ Reward Machines (RMs) to incorporate high-level task knowledge. Firstly, we introduce Q-learning with Reward Machines for Stochastic Games (QRM-SG), where RMs are predefined and available to agents. QRM-SG learns each agent’s best-response policy at Nash equilibrium by defining the Q-function in augmented state space that integrates the stochastic game and RM states. The Lemke-Howson method is utilized to compute the best-response policies for the stage game defined by the current Q-functions at each time step. Subsequently, we explore a more challenging scenario where RMs are unavailable and propose Multi-Agent Reinforcement learning with Concurrent High-level knowledge inference (MARCH). MARCH uses automata learning to learn RMs iteratively and combines this process with QRM-SG for learning the best-response policies. The RL episodes where the obtained rewards are inconsistent with the rewards from the current RMs trigger the inference of new RMs. We prove QRM-SG and MARCH converge to the best-response policies under certain conditions. Two scenarios are conducted to demonstrate the superior performance of QRM-SG and MARCH compared to baseline methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call