Conservative Contextual Combinatorial Cascading Bandit

Kun Wang

doi:10.1109/access.2021.3124416

Abstract

Contextual combinatorial cascading bandit ( $C^{3}$ -bandit) is a powerful multi-armed bandit framework that balances the tradeoff between exploration and exploitation in the learning process. It well captures users’ click behavior and has been applied in a broad spectrum of real-world applications such as recommender systems and search engines. However, such a framework does not provide a performance guarantee of the initial exploration phase. To that end, we propose conservative contextual combinatorial cascading bandit ( $C^{4}$ -bandit) model, aiming to address the aforementioned crucial modeling issues. In this problem, the learning agent is given some contexts and recommends a list of items not worse than the baseline strategy, and then observes the reward by some stopping rule. The objective is now to maximize the reward while simultaneously satisfying the safety constraint, i.e. guaranteeing the algorithm to perform at least as well as a baseline strategy. To tackle this new problem, we extend an online learning algorithm, called Upper Confidence Bound (UCB), to deal with a critical tradeoff between exploitation and exploration and employ the conservative mechanism to properly handle the safety constraints. By carefully integrating these two techniques, we develop a new algorithm, called $C^{4}$ -UCB for this problem. Further, we rigorously prove the n-step upper bound in two situations: known baseline reward and unknown baseline reward. The regret in both situations is only enlarged by an additive constant term compared to results of $C^{3}$ -bandit. Finally, experiments on synthetic and realistic datasets demonstrate its advantages.

Highlights

M ANY problems in the real world can be formulated as decision-making problems under uncertainty
In this paper, motivated by maximizing the reward while simultaneously satisfying the safety constraint in C3-bandit, we propose a novel setting, conservative contextual combinatorial cascading bandit (C4-bandit)
We develop the√upper confidence bound (UCB)-based algorithm and give O( T log T ) regret bound in both known baseline reward and unknown baseline reward situation

Summary

INTRODUCTION

M ANY problems in the real world can be formulated as decision-making problems under uncertainty. In a real scenario, before running bandit algorithms, the agent generally possesses a small part of user data, which can be used to extract a baseline strategy This baseline strategy does not have good overall cumulative performance, it provides the guarantee of worst performance each time step. The conservative bandit [24]–[27] assumes the agent already has a baseline strategy and wish to design an algorithm better than it at each time step These models merely illustrate the classic MAB with safety guarantee. We reduce conservative-incurred regret to a constant for the first time using Lipschitz quality and the range of the reward This result can extend to general combinatorial semi-bandits [4] with safety guarantee. The experiments on both synthetic data and realistic data are conducted and the results validate our theoretical findings

RELATED WORK

PROBLEM FORMULATION

ALGORITHMS

REGRET ANALYSIS

REGRET IN KNOWN BASELINE REWARD SITUATION

REGRET IN THE UNKNOWN BASELINE REWARD SITUATION

A BETTER CONSERVATIVE CONSTRAINT FOR CONSERVATIVE CONTEXTUAL LINEAR BANDIT

SYNTHETIC DATA

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE access : practical innovations, open solutions	Publication Date: Jan 1, 2021
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Conservative Contextual Combinatorial Cascading Bandit

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE access : practical innovations, open solutions

Lead the way for us

Similar Papers

Enhancing movie recommendations through comparative analysis of UCB algorithm variants
Qi He
Applied and Computational Engineering | VOL. 68
Qi HeQi He
06 Jun 2024
Applied and Computational Engineering | VOL. 68

How does users' interest influence their click behavior?: evidence from Chinese online video media.
Dongqi Li ... Nan Zhao
Frontier in Psychology | VOL. 14
Dongqi Li, et. al.Dongqi Li ... Nan Zhao
01 Jan 2023
Frontier in Psychology | VOL. 14

Research on the principle, performance, and application of UCB algorithm in multi arm slot machine problems
Ruijie Huang
Applied and Computational Engineering | VOL. 47
Ruijie HuangRuijie Huang
15 Mar 2024
Applied and Computational Engineering | VOL. 47

Some Variations of Upper Confidence Bound for General Game Playing
Iván Francisco-Valencia ... José Raymundo Marcial-Romero
-
Iván Francisco-Valencia, et. al.Iván Francisco-Valencia ... José Raymundo Marcial-Romero
01 Jan 2019
01 Jan 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Conservative Contextual Combinatorial Cascading Bandit

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE access : practical innovations, open solutions