Abstract

Contextual bandit is a popular sequential decision-making framework to balance the exploration and exploitation tradeoff in many applications such as recommender systems, search engines, etc. Motivated by two important factors in real-world applications: 1) latent contexts (or features) often exist and 2) feedbacks often have humans in the loop leading to human biases, we formulate a generalized contextual bandit framework with latent contexts. Our proposed framework includes a two-layer probabilistic interpretable model for the feedbacks from human with latent features. We design a GCL-PS algorithm for the proposed framework, which utilizes posterior sampling to balance the exploration and exploitation tradeoff. We prove a sublinear regret upper bound for GCL-PS, and prove a lower bound for the proposed bandit framework revealing insights on the optimality of GCL-PS. To further improve the computational efficiency of GCL-PS, we propose a Markov Chain Monte Carlo (MCMC) algorithm to generate approximate samples, resulting in our GCL-PSMC algorithm. We not only prove a sublinear Bayesian regret upper bound for our GCL-PSMC algorithm, but also reveal insights into the tradeoff between computational efficiency and sequential decision accuracy. Finally, we apply the proposed framework to hotel recommendations and news article recommendations, and show its superior performance over a variety of baselines via experiments on two public datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call