Multiagent Reinforcement Learning: Rollout and Policy Iteration

Dimitri Bertsekas

doi:10.1109/jas.2021.1003814

Abstract

We discuss the solution of complex multistage decision problems using methods that are based on the idea of policy iteration (PI), i.e., start from some base policy and generate an improved policy. Rollout is the simplest method of this type, where just one improved policy is generated. We can view PI as repeated application of rollout, where the rollout policy at each iteration serves as the base policy for the next iteration. In contrast with PI, rollout has a robustness property: it can be applied on-line and is suitable for on-line replanning. Moreover, rollout can use as base policy one of the policies produced by PI, thereby improving on that policy. This is the type of scheme underlying the prominently successful AlphaZero chess program. In this paper we focus on rollout and PI-like methods for problems where the control consists of multiple components each selected (conceptually) by a separate agent. This is the class of multiagent problems where the agents have a shared objective function, and a shared and perfect state information. Based on a problem reformulation that trades off control space complexity with state space complexity, we develop an approach, whereby at every stage, the agents sequentially (one-at-a-time) execute a local rollout algorithm that uses a base policy, together with some coordinating information from the other agents. The amount of total computation required at every stage grows linearly with the number of agents. By contrast, in the standard rollout algorithm, the amount of total computation grows exponentially with the number of agents. Despite the dramatic reduction in required computation, we show that our multiagent rollout algorithm has the fundamental cost improvement property of standard rollout: it guarantees an improved performance relative to the base policy. We also discuss autonomous multiagent rollout schemes that allow the agents to make decisions autonomously through the use of precomputed signaling information, which is sufficient to maintain the cost improvement property, without any on-line coordination of control selection between the agents. For discounted and other infinite horizon problems, we also consider exact and approximate PI algorithms involving a new type of one-agent-at-a-time policy improvement operation. For one of our PI algorithms, we prove convergence to an agent-by-agent optimal policy, thus establishing a connection with the theory of teams. For another PI algorithm, which is executed over a more complex state space, we prove convergence to an optimal policy. Approximate forms of these algorithms are also given, based on the use of policy and value neural networks. These PI algorithms, in both their exact and their approximate form are strictly off-line methods, but they can be used to provide a base policy for use in an on-line multiagent rollout scheme.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multiagent Reinforcement Learning: Rollout and Policy Iteration

Abstract

Talk to us

Similar Papers

More From: IEEE/CAA Journal of Automatica Sinica

Lead the way for us

Journal: IEEE/CAA Journal of Automatica Sinica	Publication Date: Feb 1, 2021
Citations: 108

Similar Papers

Traffic Signal Control based on Markov Decision Process**This work is supported in part by the National Science Foundation of China (Grant No. 61374110, 61433002, 61221003), NSFC International Cooperation Project (Grant No. 71361130012).
Yunwen Xu ... Zhao Zhou
IFAC-PapersOnLine | VOL. 49
Yunwen Xu, et. al.Yunwen Xu ... Zhao Zhou
01 Jan 2015
IFAC-PapersOnLine | VOL. 49

A Comparison of Policy Iteration Methods for Solving Continuous-State, Infinite-Horizon Markovian Decision Problems Using Random, Quasi-random, and Deterministic Discretizations
John P Rust
SSRN Electronic Journal | VOL. -
John P RustJohn P Rust
01 Jan 1997
SSRN Electronic Journal | VOL. -

Filter based Explorized Policy Iteration Algorithm for On-Policy Approximate LQR
Sumit Kumar Jha ... Shubhendu Bhasin
-
Sumit Kumar Jha, et. al.Sumit Kumar Jha ... Shubhendu Bhasin
01 Dec 2019
01 Dec 2019

Optimal [formula omitted] tracking control of nonlinear systems with zero-equilibrium-free via novel adaptive critic designs
Zhinan Peng ... Bijoy Kumar Ghosh
Neural Networks | VOL. 164
Zhinan Peng, et. al.Zhinan Peng ... Bijoy Kumar Ghosh
20 Apr 2023
Neural Networks | VOL. 164

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multiagent Reinforcement Learning: Rollout and Policy Iteration

Abstract

Talk to us

Similar Papers

More From: IEEE/CAA Journal of Automatica Sinica