Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments

Mike Li,Quang Dang Nguyen

doi:10.1109/access.2021.3094623

Mike Li, Quang Dang Nguyen

Open Access

https://doi.org/10.1109/access.2021.3094623

Copy DOI

Abstract

Learning action policy for autonomous agents in a decentralized multi-agent environment has remained an interesting but difficult research problem. We propose to model this problem in a contextual bandit setting with delayed reward signals, in particular an individual short-term reward signal and a shared long-term reward signal. Our algorithm utilizes an approach with reward oracles to directly model these delayed reward signals and also relies on a learning scheme benefiting from the sampling guidance of an expert-designed policy. This algorithm is expected to apply to a wide range of problems, including those with constraints on accessing state transitions and those with implicit reward information. A demonstration, deployed by deep learning regressors, shows the effectiveness of the proposed algorithm in learning offensive action policy in the RoboCup Soccer 2D Simulation (RCSS) environment against a well-known adversary benchmark team compared to a baseline policy.

Highlights

L EARNING optimal policies for agents in decentralized multi-agent environments has occupied the significant attention of scholars
Unlike the previous work [46] in which delayed reward signal was used together with Double Deep Q-learning algorithm to address these challenges, we explore another approach using realizability-based contextual bandit learning with the belief about the availability of regression oracles to generate delayed reward signals suitable for decentralized multi-agent learning
The general performance of the team, expressed using the goal difference between the average scored goals and average conceded goals (Figure 1 - Bottom), improves in all cases. This empirically confirms the effectiveness of embedding the inferences of the delayed individual short-term and shared long-term rewards to select between different actions under a contextual bandit learning (CBL) setting

Summary

Introduction

L EARNING optimal policies for agents in decentralized multi-agent environments has occupied the significant attention of scholars. Depending on the nature of the environments, multi-agent problems can be modeled with the consideration of various criteria such as centralized or decentralized control, complete or partial observation about the environment, competition or cooperation among agents [49], and heterogeneity of agent types [39, 73]. One of them considers each agent as an independent learner and other agents as a part of the environment [16]. This approach was realized under the assumption that information can be shared

Objectives

Methods

Results

Conclusion