Risk-Aware Multi-Armed Bandits With Refined Upper Confidence Bounds

Xingchi Liu,Mahsa Derakhshani,Mihaela van der Schaar,Sangarapillai Lambotharan

doi:10.1109/lsp.2020.3047725

Abstract

The classical multi-armed bandit (MAB) framework studies the exploration-exploitation dilemma of the decision-making problem and always treats the arm with the highest expected reward as the optimal choice. However, in some applications, an arm with a high expected reward can be risky to play if the variance is high. Hence, the variation of the reward should be considered to make the arm-selection process risk-aware. In this letter, the mean-variance metric is investigated to measure the uncertainty of the received rewards. We first study a risk-aware MAB problem when the reward follows a Gaussian distribution, and a concentration inequality on the variance is developed to design a Gaussian risk aware-upper confidence bound algorithm. Furthermore, we extend this algorithm to a novel asymptotic risk aware-upper confidence bound algorithm by developing an upper confidence bound of the variance based on the asymptotic distribution of the sample variance. Theoretical analysis proves that both proposed algorithms achieve the $\mathcal {O}(\log (T))$ regret. Finally, numerical results demonstrate that our algorithms outperform several risk-aware MAB algorithms.

Highlights

The multi-armed bandit (MAB) models the online sequential decision making problem for various applications including financial portfolio design, online recommendations, and crowdsourcing systems
In a standard MAB, a random reward of an arm can be observed once played by an agent, and the objective is to maximize the cumulative rewards in a certain number of plays
We study the risk-aware MAB problem based on the MV paradigm and propose two novel risk-aware MAB algorithms

Summary

INTRODUCTION

The multi-armed bandit (MAB) models the online sequential decision making problem for various applications including financial portfolio design, online recommendations, and crowdsourcing systems. In clinical trials, instead of choosing a treatment which reaches the best average therapeutic result but may occasionally lead to unacceptably poor results, a treatment that works consistently well for every patient is more reliable and desirable For such applications, the tradeoff between the expected rewards and the variances of arms should be considered in a risk-aware MAB framework. Utilizing the asymptotic distribution of the empirical variance and by extending the GRA-UCB algorithm, a novel asymptotic risk aware-upper confidence bound (ARA-UCB) algorithm is generally designed for sub-Gaussian reward distributions and proved to achieve O(log(T )) learning regret.

PROBLEM FORMULATION

PROPOSED ALGORITHM

THEORETICAL ANALYSIS

Learning regret of GRA-UCB

Learning regret of ARA-UCB

NUMERICAL RESULTS

CONCLUSION

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Signal Processing Letters	Publication Date: Dec 28, 2020
Citations: 15	License type: cc-by

R Discovery Prime

R Discovery Prime

Risk-Aware Multi-Armed Bandits With Refined Upper Confidence Bounds

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Signal Processing Letters

Lead the way for us

Similar Papers

User Pairing and Power Allocation for UAV-NOMA Systems Based on Multi-Armed Bandit Framework
Brena Kelly Sousa Lima ... Rodolfo Oliveira
IEEE Transactions on Vehicular Technology | VOL. 71
Brena Kelly Sousa Lima, et. al.Brena Kelly Sousa Lima ... Rodolfo Oliveira
01 Dec 2022
IEEE Transactions on Vehicular Technology | VOL. 71

Identification of Top-K Influencers Based on Upper Confidence Bound and Local Structure
Mohammed Alshahrani ... Soufiana Mekouar
Big Data Research | VOL. 25
Mohammed Alshahrani, et. al.Mohammed Alshahrani ... Soufiana Mekouar
12 Feb 2021
Big Data Research | VOL. 25

Frequentist multi-armed bandits for complex reward models
Haoran Chen ... Teresa A Oliveira
-
Haoran Chen, et. al.Haoran Chen ... Teresa A Oliveira
22 Apr 2022
22 Apr 2022

Multi-Armed Bandit Learning for Full-Duplex UAV Relay Positioning for Vehicular Communications
Pouya Pourbaba ... K B Shashika Manosha
-
Pouya Pourbaba, et. al.Pouya Pourbaba ... K B Shashika Manosha
01 Aug 2019
01 Aug 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Risk-Aware Multi-Armed Bandits With Refined Upper Confidence Bounds

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Signal Processing Letters