Abstract

The classical multi-armed bandit (MAB) framework studies the exploration-exploitation dilemma of the decision-making problem and always treats the arm with the highest expected reward as the optimal choice. However, in some applications, an arm with a high expected reward can be risky to play if the variance is high. Hence, the variation of the reward should be considered to make the arm-selection process risk-aware. In this letter, the mean-variance metric is investigated to measure the uncertainty of the received rewards. We first study a risk-aware MAB problem when the reward follows a Gaussian distribution, and a concentration inequality on the variance is developed to design a Gaussian risk aware-upper confidence bound algorithm. Furthermore, we extend this algorithm to a novel asymptotic risk aware-upper confidence bound algorithm by developing an upper confidence bound of the variance based on the asymptotic distribution of the sample variance. Theoretical analysis proves that both proposed algorithms achieve the $\mathcal {O}(\log (T))$ regret. Finally, numerical results demonstrate that our algorithms outperform several risk-aware MAB algorithms.

Highlights

  • The multi-armed bandit (MAB) models the online sequential decision making problem for various applications including financial portfolio design, online recommendations, and crowdsourcing systems

  • In a standard MAB, a random reward of an arm can be observed once played by an agent, and the objective is to maximize the cumulative rewards in a certain number of plays

  • We study the risk-aware MAB problem based on the MV paradigm and propose two novel risk-aware MAB algorithms

Read more

Summary

INTRODUCTION

The multi-armed bandit (MAB) models the online sequential decision making problem for various applications including financial portfolio design, online recommendations, and crowdsourcing systems. In clinical trials, instead of choosing a treatment which reaches the best average therapeutic result but may occasionally lead to unacceptably poor results, a treatment that works consistently well for every patient is more reliable and desirable For such applications, the tradeoff between the expected rewards and the variances of arms should be considered in a risk-aware MAB framework. Utilizing the asymptotic distribution of the empirical variance and by extending the GRA-UCB algorithm, a novel asymptotic risk aware-upper confidence bound (ARA-UCB) algorithm is generally designed for sub-Gaussian reward distributions and proved to achieve O(log(T )) learning regret.

PROBLEM FORMULATION
PROPOSED ALGORITHM
THEORETICAL ANALYSIS
Learning regret of GRA-UCB
Learning regret of ARA-UCB
NUMERICAL RESULTS
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call