Constant Regret Research Articles

We study online learning and equilibrium computation in games with polyhedral decision sets, a property shared by normal-form games (NFGs) and extensive-form games (EFGs), when the learning agent is restricted to utilizing a best-response oracle. We show how to achieve constant regret in zero-sum games and O(T^0.25) regret in general-sum games while using only O(log t) best-response queries at a given iteration t, thus improving over the best prior result, which required O(T) queries per iteration. Moreover, our framework yields the first last-iterate convergence guarantees for self-play with best-response oracles in zero-sum games. This convergence occurs at a linear rate, though with a condition-number dependence. We go on to show a O(T^(-0.5)) best-iterate convergence rate without such a dependence. Our results build on linear-rate convergence results for variants of the Frank-Wolfe (FW) algorithm for strongly convex and smooth minimization problems over polyhedral domains. These FW results depend on a condition number of the polytope, known as facial distance. In order to enable application to settings such as EFGs, we show two broad new results: 1) the facial distance for polytopes in standard form is at least γ/k where γ is the minimum value of a nonzero coordinate of a vertex of the polytope and k≤n is the number of tight inequality constraints in the optimal face, and 2) the facial distance for polytopes of the form Ax=b, Cx≤d, x≥0 where x∈R^n, C≥0 is a nonzero integral matrix, and d≥0, is at least 1/(c√n), where c is the infinity norm of C. This yields the first such results for several problems such as sequence-form polytopes, flow polytopes, and matching polytopes.

We consider the problem of power allocation over one or more time-varying channels with unknown distributions in energy harvesting communications. In the single-channel case, the transmitter chooses the transmit power based on the amount of stored energy in its battery with the goal of maximizing the average rate over time. We model this problem as a Markov decision process (MDP) with transmitter as the agent, battery status as the state, transmits power as the action and rate as the reward. The average reward maximization problem can be modeled by a linear program (LP) that uses the transition probabilities for the state-action pairs and their reward values to select a power allocation policy. This problem is challenging because the uncertainty in channels implies that the mean rewards associated with the state-action pairs are unknown. We therefore propose two online learning algorithms: linear program of sample means (LPSM) and Epoch-LPSM that learn these rewards and adapt their policies over time. For both algorithms, we prove that their regret is upper-bounded by a constant. To our knowledge this is the first result showing constant regret learning algorithms for MDPs with unknown mean rewards. We also prove an even stronger result about LPSM: that its policy matches the optimal policy exactly in finite expected time. Epoch-LPSM incurs a higher regret compared with the LPSM, while reducing the computational requirements substantially. We further consider a multi-channel scenario, where the agent also chooses a channel in each slot, and present our multi-channel LPSM (MC-LPSM) algorithm that explores different channels and uses that information to solve the LP during exploitation. MC-LPSM incurs a regret that scales logarithmically in time and linearly in the number of channels. Through a matching lower bound on the regret of any algorithm, we also prove the asymptotic order optimality of MC-LPSM.

Constant Regret Research Articles

Articles published on Constant Regret

Bandits with Stochastic Experts: Constant Regret, Empirical Experts and Episodes

Online Markov Decision Processes Configuration with Continuous Decision Space

Efficient Learning in Polyhedral Games via Best-Response Oracles

Distributed online optimization for heterogeneous linear multi-agent systems with coupled constraints

On the regret of online edge service hosting

Dynamic Matching: Characterizing and Achieving Constant Regret

On the Regret of Online Edge Service Hosting

Bayesian dithering for learning: Asymptotically optimal policies in dynamic pricing

Constant Regret Resolving Heuristics for Price-Based Revenue Management

One for All and All for One: Distributed Learning of Fair Allocations With Multi-Player Bandits

Policy Optimization as Online Learning with Mediator Feedback

Multiplayer Bandits: A Trekking Approach

Online Allocation and Pricing: Constant Regret via Bellman Inequalities

Dynamic Matching: Characterizing and Achieving Constant Regret

Second-Order Online Nonconvex Optimization

Dithering for Learning: Computationally Efficient Policies for Dynamic Pricing in High Dimensions

Distributed Learning and Coordination in Cognitive Infrastructureless Networks of Unknown Size

Distributed Learning in Ad-Hoc Networks with Unknown Number of Players

Satisfactory content delivery scheme for QoS provisioning in delay tolerant networks

Online Learning Schemes for Power Allocation in Energy Harvesting Communications

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Constant Regret Research Articles

Articles published on Constant Regret

Bandits with Stochastic Experts: Constant Regret, Empirical Experts and Episodes

Online Markov Decision Processes Configuration with Continuous Decision Space

Efficient Learning in Polyhedral Games via Best-Response Oracles

Distributed online optimization for heterogeneous linear multi-agent systems with coupled constraints

On the regret of online edge service hosting

Dynamic Matching: Characterizing and Achieving Constant Regret

On the Regret of Online Edge Service Hosting

Bayesian dithering for learning: Asymptotically optimal policies in dynamic pricing

Constant Regret Resolving Heuristics for Price-Based Revenue Management

One for All and All for One: Distributed Learning of Fair Allocations With Multi-Player Bandits

Policy Optimization as Online Learning with Mediator Feedback

Multiplayer Bandits: A Trekking Approach

Online Allocation and Pricing: Constant Regret via Bellman Inequalities

Dynamic Matching: Characterizing and Achieving Constant Regret

Second-Order Online Nonconvex Optimization

Dithering for Learning: Computationally Efficient Policies for Dynamic Pricing in High Dimensions

Distributed Learning and Coordination in Cognitive Infrastructureless Networks of Unknown Size

Distributed Learning in Ad-Hoc Networks with Unknown Number of Players

Satisfactory content delivery scheme for QoS provisioning in delay tolerant networks

Online Learning Schemes for Power Allocation in Energy Harvesting Communications