Abstract

ly, a learning automaton [2] can be considered to be an object that can choose from a finite number of actions. For every action that it chooses, the random environment in which it operates evaluates that action. A corresponding feedback is sent to the automaton based on which the next action is chosen. As this process progresses the automaton learns to choose the optimal action for that unknown environment asymptotically. The stochastic iterative algorithm used by the automaton to select its successive actions based on the environment’s response defines the stochastic learning algorithm. An important property of the learning automaton is its ability to improve its performance with time while operation in an unknown environment. In this chapter, for the sake of consistency our notations follow or parallels that from standard books on game theory (e.g., [3]) and stochastic learning [4]. In multiple automata games, instead of one automaton (player) playing against the environment, N automata, say A1, A2, ..., AN take part in a game. Consider a typical automaton Ai described by a 4-tuple {Si, ri, Ti,pi}. Each player i has a finite set of actions or pure strategies, Si, 1 ≤ i ≤ N . Let the cardinality of Si be mi, 1 ≤ i ≤ N . The result of each play is a random payoff to each player. Let ri denote the random payoff to player i, 1 ≤ i ≤ N . It is assumed here that ri ∈ [0, 1]. Define functions d : Π j=1Sj → [0, 1], 1 ≤ i ≤ N, by d(a1, ..., aN ) = E[ri|player j chose action aj , aj ∈ Sj , 1 ≤ j ≤ N ]. (0.1) The function d is called the expected payoff function or utility function of player i, 1 ≤ i ≤ N . The objective of each player is to maximize its expected payoff. Players choose their strategies based on a time-varying probability distribution. Let pi(k) = [pi1(k)...pimi(k)] t denote the action choice probability distribution of the i automaton at time instance k. Then pil(k) denotes the probability with which i automaton player chooses the l pure strategy at instant k. Thus pi(k) is the strategy probability vector employed by the i player at instant k. Ti denotes the stochastic learning algorithm according to which the elements of the set pi are updated at each time k, i.e.,

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call