Abstract

This paper describes a new method of dynamic programming (DP) based multiagent reinforcement learning in Markov decision process (MDP) model. It is difficult for agents to learn cooperative actions among agents properly in multiagent because they may change each policy in same time. To solve this problem, each agent should learn in different time for each policy improvement. Therefore, we propose multiple timescales policy improvement method. We show comparative experiments between multiple timescales policy improvement and exclusive policy improvement. As a result, our methods reduced the search costs for the optimal common-payoff Nash solution.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call