Abstract

A sampling-based algorithm for solving stochastic optimization based on Auer et. al.'s Exp3 algorithm for adversarial multi-armed bandit problems, has been recently presented by the authors. In particular, the authors recursively extended the Exp3-based algorithm for solving finite-horizon Markov decision processes (MDPs) and analyzed its finite-iteration performance in terms of the expected bias relative to the maximum value of the recursive sample- average-approximation (SAA) problem induced by the sampling process in the algorithm, showing that the upper bound of the expected bias approaches zero as the sampling size per state sampled in each stage goes to infinity, leading to the convergence to the optimal value of the original MDP problem in the limit. As a sequel to the previous work, the idea is further extended for solving two-person zero-sum Markov games (MGs), providing a finite-iteration bound to the equilibrium value of the induced recursive SAA game problem and asymptotic convergence to the true equilibrium value. The recursively extended algorithm for MGs can be used for breaking the curse of dimensionality.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call