Cumulative Regret Research Articles

We consider a price-based revenue management problem with finite reusable resources over a finite time horizon. Stochastically arrived customers request an exponentially distributed service time and may balk and renege given insufficient resource. The resource unit, upon completion of serving one customer, can be released to serve the next customer immediately. The arrival, service, balking, and reneging rates all depend on the price being offered. In this paper, we assume that the firm does not know the mappings between these rates and prices, and thus it makes adaptive pricing decisions in each period based only on past sales to maximize the cumulative revenue. We propose two new multi-armed bandit (MAB) based learning algorithms, termed Batch Upper Confidence Bound (BUCB) algorithm and Batch Thompson Sampling (BTS) algorithm, for finding near-optimal pricing policies. Compared with prior pricing and MAB literature, the salient difficulties of this problem lie in (i) the unknown rate-and-price mapping information, (ii) the dynamic nature of reusable resources being committed over time, (iii) the transient behavior of the service system when the price changes, and (iv) unbounded and heavy-tailed distributions of observed random variables. Our proposed algorithms contain a Warm-up Phase to eliminate the heavy-tail effects and a Learning Phase to identify the optimal price. Our algorithms separate the Learning Phase into successive operational batches and select a price from a prescribed set in each batch using past sales collected in previous batches. The performance measure is cumulative regret, which is the difference between the revenue attained by our approach and by a clairvoyant optimal pricing policy under full distributional information. We prove that the cumulative regret is $O(\sqrt{PT\log (T)})$, where $T$ is the total number of time periods and $P$ is the cardinality of the feasible price set, and the result matches the lower bound up to a logarithmic factor. As an intermediate step, we also develop a coupling analysis for analyzing the time for a queue to reach the steady state from an empty state or from a steady state under another set of system parameters. Our numerical experiments demonstrate and confirm the efficacy of the proposed BUCB and BTS algorithms.

Read full abstract

Motivated by emerging need of learning algorithms for large scale networked and decentralized systems, we introduce a distributed version of the classical stochastic Multi-Arm Bandit (MAB) problem. Our setting consists of a large number of agents n that collaboratively and simultaneously solve the same instance of K armed MAB to minimize the average cumulative regret over all agents. The agents can communicate and collaborate among each other only through a pairwise asynchronous gossip based protocol that exchange a limited number of bits. In our model, agents at each point decide on (i) which arm to play, (ii) whether to, and if so (iii) what and whom to communicate with. Agents in our model are decentralized, namely their actions only depend on their observed history in the past. We develop a novel algorithm in which agents, whenever they choose, communicate only arm-ids and not samples, with another agent chosen uniformly and independently at random. The per-agent regret scaling achieved by our algorithm is $\BigO łeft( \fracłceil\fracK n \rceil+łog(n) Δ łog(T) + \fracłog^3(n) łog łog(n) Δ^2 \right) $. Furthermore, any agent in our algorithm communicates (arm-ids to an uniformly and independently chosen agent) only a total of Θ(łog(T))$ times over a time interval of T. We compare our results to two benchmarks - one where there is no communication among agents and one corresponding to complete interaction, where an agent has access to the entire system history of arms played and rewards obtained of all agents. We show both theoretically and empirically, that our algorithm experiences a significant reduction both in per-agent regret when compared to the case when agents do not collaborate and each agent is playing the standard MAB problem (where regret would scale linearly in K), and in communication complexity when compared to the full interaction setting which requires T communication attempts by an agent over T arm pulls. Our result thus demonstrates that even a minimal level of collaboration among the different agents enables a significant reduction in per-agent regret.

Read full abstract

Cumulative Regret Research Articles

Articles published on Cumulative Regret

Contextual Inverse Optimization: Offline and Online Learning

Safe Linear Thompson Sampling With Side Information

Regret lower bound and optimal algorithm for high-dimensional contextual linear bandit

Nonstationary Bandits with Habituation and Recovery Dynamics

Technical note: Finite‐time regret analysis of Kiefer‐Wolfowitz stochastic approximation algorithm and nonparametric multi‐product dynamic pricing with unknown demand

Trading Convergence Rate with Computational Budget in High Dimensional Bayesian Optimization

Online Learning and Pricing for Service Systems with Reusable Resources

Online Pricing with Reserve Price Constraint for Personal Data Markets

Cost-Aware Cascading Bandits

Conducting Non-adaptive Experiments in a Live Setting: A Bayesian Approach to Determining Optimal Sample Size

Social Learning in Multi Agent Multi Armed Bandits

A Heuristic for Learn-and-Optimize New Mobility Services with Equity and Efficiency Metrics

A Policy for Optimizing Sub-Band Selection Sequences in Wideband Spectrum Sensing

Everybody Needs Somebody Sometimes: Validation of Adaptive Recovery in Robotic Space Operations

Learning Optimal Online Advertising Portfolios with Periodic Budgets

Cost-Aware Learning and Optimization for Opportunistic Spectrum Access

Dynamic Pricing with Unknown Non-Parametric Demand and Limited Price Changes

Influence Maximization Based Global Structural Properties: A Multi-Armed Bandit Approach

Selfish Bandit-Based Cognitive Anti-Jamming Strategy for Aeronautic Swarm Network in Presence of Multiple Jammer

Online Assortment Optimization with High-Dimensional Data

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Cumulative Regret Research Articles

Articles published on Cumulative Regret

Contextual Inverse Optimization: Offline and Online Learning

Safe Linear Thompson Sampling With Side Information

Regret lower bound and optimal algorithm for high-dimensional contextual linear bandit

Nonstationary Bandits with Habituation and Recovery Dynamics

Technical note: Finite‐time regret analysis of Kiefer‐Wolfowitz stochastic approximation algorithm and nonparametric multi‐product dynamic pricing with unknown demand

Trading Convergence Rate with Computational Budget in High Dimensional Bayesian Optimization

Online Learning and Pricing for Service Systems with Reusable Resources

Online Pricing with Reserve Price Constraint for Personal Data Markets

Cost-Aware Cascading Bandits

Conducting Non-adaptive Experiments in a Live Setting: A Bayesian Approach to Determining Optimal Sample Size

Social Learning in Multi Agent Multi Armed Bandits

A Heuristic for Learn-and-Optimize New Mobility Services with Equity and Efficiency Metrics

A Policy for Optimizing Sub-Band Selection Sequences in Wideband Spectrum Sensing

Everybody Needs Somebody Sometimes: Validation of Adaptive Recovery in Robotic Space Operations

Learning Optimal Online Advertising Portfolios with Periodic Budgets

Cost-Aware Learning and Optimization for Opportunistic Spectrum Access

Dynamic Pricing with Unknown Non-Parametric Demand and Limited Price Changes

Influence Maximization Based Global Structural Properties: A Multi-Armed Bandit Approach

Selfish Bandit-Based Cognitive Anti-Jamming Strategy for Aeronautic Swarm Network in Presence of Multiple Jammer

Online Assortment Optimization with High-Dimensional Data