Abstract

In this paper, we develop a data-driven algorithm to learn the Nash equilibrium solution for a two-player non-zero-sum (NZS) game with completely unknown linear discrete-time dynamics based on off-policy reinforcement learning (RL). This algorithm solves the coupled algebraic Riccati equations (CARE) forward in time in a model-free manner by using the online measured data. We first derive the CARE for solving the two-player NZS game. Then, model-free off-policy RL is developed to obviate the requirement of complete knowledge of system dynamics. Besides, on- and off-policy RL algorithms are compared in terms of the robustness against the probing noise. Finally, a simulation example is presented to show the efficacy of the presented approach.

Highlights

  • Game theory is widely used in the complex decision-making problems where the collective behavior depends on the compilation of local interactions [1], [2]

  • In this paper, we develop an on- and off-policy variants of reinforcement learning algorithm to learn online the Nash equilibrium solution for the two-player NZS game of linear discrete-time (DT) dynamics

  • Off-policy is robust to the probing noise, i.e., there is no bias as a result of adding a probing noise to the control input to satisfy the condition of the persistence of excitation

Read more

Summary

INTRODUCTION

Game theory is widely used in the complex decision-making problems where the collective behavior depends on the compilation of local interactions [1], [2]. A novel model-free algorithm is developed for the discrete-time systems to solve the NZS game to obviate the requirement of complete knowledge of system dynamics. MODEL-BASED ADAPTIVE DYNAMIC PROGRAMMING In Section II, an off-line algorithm is developed to solve CARE (16) and (17), which are sufficient and necessary conditions for the Nash equilibrium. It is shown that one can be approximate the solution to the CARE (16) and (17) by iteratively solving the off-policy Bellman equations (48) and (49). The off-policy Bellman equation (49) can be rewritten as xkT Pi2xk = xkT Q1xk +xkT K i T R21K ixk +xkT Li T R22Lixk xkT + xkT A+B1K i +B2Li T Pi2 A+B1K i +B2Li xk . As discussed in [52], the stopping criterion with

1: Data Collection Phase
SIMULATION
CASE 1
CASE 2
CASE 3
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call