The biobjective multiarmed bandit: learning approximate lexicographic optimal allocations

Cem Teki̇n

doi:10.3906/elk-1806-221

Abstract

We consider a biobjective sequential decision-making problem where an allocation (arm) is called ε lexi- cographic optimal if its expected reward in the first objective is at most ε smaller than the highest expected reward, and its expected reward in the second objective is at least the expected reward of a lexicographic optimal arm. The goal of the learner is to select arms that are ε lexicographic optimal as much as possible without knowing the arm reward distributions beforehand. For this problem, we first show that the learner’s goal is equivalent to minimizing the ε lexicographic regret, and then, propose a learning algorithm whose ε lexicographic gap-dependent regret is bounded and gap-independent regret is sublinear in the number of rounds with high probability. Then, we apply the proposed model and algorithm for dynamic rate and channel selection in a cognitive radio network with imperfect channel sensing. Our results show that the proposed algorithm is able to learn the approximate lexicographic optimal rate–channel pair that simultaneously minimizes the primary user interference and maximizes the secondary user throughput.

Highlights

The multiarmed bandit (MAB) is used to model real-world applications in which the decision maker repeatedly interacts with its unknown environment in order to maximize its long-term reward [1, 2]
In order to quantify the loss of the learner due to not knowing the ε lexicographic optimal arms beforehand, we introduce the notion of ε lexicographic regret, and propose a learning algorithm whose ε lexicographic gap-dependent regret is O(1) and gap-independent regret is O(√T ) with high probability
The regret in objective 2 decreases because for small values of ε, ALEX frequently selects the arm with the highest upper confidence bound (UCB) in objective 1 instead of searching for an approximate lexicographic optimal arm in order to make sure that it learns the best arm in objective 1 well

Summary

Introduction

The multiarmed bandit (MAB) is used to model real-world applications in which the decision maker repeatedly interacts with its unknown environment in order to maximize its long-term reward [1, 2]. A plethora of prior works on the MAB focused on designing learning algorithms that optimize the total scalar reward. These include the celebrated upper confidence bound (UCB) policies [1, 6] and posterior sampling [2, 7]. In the biobjective MAB, the learner receives, at each round, random rewards from two objectives These objectives are lexicographically ordered in the sense that the learner values the first objective more than the second objective

Objectives

Methods

Results

Conclusion