Abstract

In this article, we investigate the Nash-seeking problem of a set of agents, playing an infinite network aggregative Markov game. In particular, we focus on a noncooperative framework where each agent selfishly aims at maximizing its long-term average reward without having explicit information on the model of the environment dynamics and its own reward function. The main contribution of this article is to develop a continuous multiagent reinforcement learning (MARL) algorithm for the Nash-seeking problem in infinite dynamic games with convergence guarantee. To this end, we propose an actor-critic MARL algorithm based on expected policy gradient (EPG) with two general function approximators to estimate the value function and the Nash policy of the agents. We consider continuous state and action spaces and adopt a newly proposed EPG to alleviate the variance of the gradient approximation. Based on such formulation and under some conventional assumptions (e.g., using linear function approximators), we prove that the policies of the agents converge to the unique Nash equilibrium (NE) of the game. Furthermore, an estimation error analysis is conducted to investigate the effects of the error arising from function approximation. As a case study, the framework is applied on a cloud radio access network (C-RAN) by modeling the remote radio heads (RRHs) as the agents and the congestion of baseband units (BBUs) as the dynamics of the environment.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call