Base on -Learning Pareto Optimality for Linear Itô Stochastic Systems With Markovian Jumps

Zhongyang Ming,Huaguang Zhang,Yanhong Luo,Weihua Li

doi:10.1109/tase.2023.3234928

Abstract

This article investigate the cooperative differential game (CDG) for continuous-time linear Itô stochastic systems with markovian jumps (SSMJ) to obtain the Pareto solutions. Different from most existing works studying nonzero-sum games, this article studies the CDG on the quadratic infinite horizon for the Itô-type SSMJ with unknown system matrix and transition probability. A novel <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$Q$</tex-math> </inline-formula> -learning online algorithm is developed, which consists of that (i) the optimal control problem is equivalent to solving a stochastic algebraic Riccatic equation (ARE); (ii) the joint cost function is approximated by a critic neural network (NN) and Pareto efficient is approximated by two actor NNs. The rigorous stability analysis shows that the system state for SSMJ and the NN weight errors are uniformly ultimately bounded (UUB). Finally, the theory analysis is validated by a numerical example with detailed discussions. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Note to Practitioners</i> —In practical applications, many systems are often affected by the change of external environment or the failure of internal components, which leads to the random jump of system parameters. Markovian jump system can effectively describe the above problems. And when the system model is disturbed by the internal parameters, the system control input and external environment, the random errors of state measurement, and other random factors, the deterministic model can no longer accurately describe the controlled system. Therefore, the SSMJ can describe practical problems more accurately. By cooperation, in general, the cost one specific player incurs is not uniquely determined anymore. If all players decide, for example, to use their control variables to reduce the cost of player 1 as much as possible, a different minimum is attained for player 1 compared with that in the case where all players agree collectively to help a different player in minimizing his cost. So, depending on how the players choose to ‘divide’ their control efforts, a player incurs different ‘minima’. Therefore, we will design an online learning algorithm to obtain pareto solutions with different weights. On the other hand, in practice, it is difficult to obtain an accurate system model. In order to solve this problem, a novel scheme is designed by using <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$Q$</tex-math> </inline-formula> -learning technology, which does not need system matrix.

Full Text