Abstract

In this paper we propose two novel distributed algorithms for iterative multi-agent off-policy linear value function approximation in Markov decision processes. The algorithms do not require any fusion center and are based on incorporating consensus-based collaborations between the agents over time-varying communication networks into recently proposed single-agent algorithms. The resulting distributed algorithms allow the agents to have different behavior policies while evaluating the response to a single target policy, using the same linear parametrization of the value function. Under appropriate assumptions on the time-varying network topology and the overall state-visiting distributions of the agents we prove for both algorithms weak convergence of the parameter estimates to a consensus point determined by an associated ODE. By a proper design of the network parameters and/or topology, this point can be tuned to coincide with the globally optimal point. The properties and the effectiveness of the proposed algorithms are illustrated on an example.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call