Bandit Feedback Research Articles

We consider a principle or controller that can pick actions from a fixed action set to control an evolving system with converging dynamics. The actions are interpreted as different configurations or policies. We consider systems with converging dynamics, i.e., if the principle holds the same action, the system will asymptotically converge (possibly requiring a significant amount of time) to a unique stable state determined by this action. This phenomenon can be observed in diverse domains such as epidemic control, computing systems, and markets. In our model, the dynamics of the system are unknown to the principle, and the principle can only receive bandit feedback (maybe noisy) on the impacts of his actions. The principle aims to learn which stable state yields the highest reward while adhering to specific constraints (i.e., optimal stable state) and to immerse the system into this state as quickly as possible. A unique challenge in our model is that the principle has no prior knowledge about the stable state of each action, but waits for the system to converge to the suboptimal stable states costs valuable time. We measure the principle's performance in terms of regret and constraint violation. In cases where the action set is finite, we propose a novel algorithm, termed Optimistic-Pessimistic Convergence and Confidence Bounds (OP-C2B), that knows to switch an action quickly if it is not worth waiting until the stable state is reached. This is enabled by employing "convergence bounds" to determine how far the system is from the stable states, and choosing actions through maintaining a pessimistic assessment of the set of feasible actions while acting optimistically within this set. We establish that OP-C2B can ensure sublinear regret and constraint violation simultaneously. Particularly, OP-C2B achieves logarithmic regret and constraint violation when the system convergence rate is linear or superlinear. Furthermore, we generalize our algorithm OP-C2B to the case of an infinite action set and demonstrate its ability to maintain sublinear regret and constraint violation. We finally show two game control problems including mobile crowdsensing and resource allocation that our model can address.

Read full abstract

We consider a multi-agent network where each node has a stochastic (local) cost function that depends on the decision variable of that node and a random variable, and further, the decision variables of neighboring nodes are pairwise constrained. There is an aggregated objective function for the network, composed additively of the expected values of the local cost functions at the nodes, and the overall goal of the network is to obtain the minimizing solution to this aggregate objective function subject to all the pairwise constraints. This is to be achieved at the level of the nodes using decentralized information and local computation, with exchanges of only compressed information allowed by neighboring nodes. The paper develops algorithms and obtains performance bounds for two different models of local information availability at the nodes: (i) sample feedback, where each node has direct access to samples of the local random variable to evaluate its local cost, and (ii) bandit feedback, where samples of the random variables are not available, but only the values of the local cost functions at two random points close to the decision are available to each node. For both models, with compressed communication between neighbors, we have developed decentralized saddle-point algorithms that deliver performances no different (in order sense) from those without communication compression; specifically, we show that deviation from the global minimum value and violations of the constraints are upper-bounded by O(T−12) and O(T−14), respectively, where T is the number of iterations. Numerical examples provided in the paper corroborate these bounds.

Read full abstract

Bandit Feedback Research Articles

Articles published on Bandit Feedback

Learning the Optimal Control for Evolving Systems with Converging Dynamics

Learning the Optimal Control for Evolving Systems with Converging Dynamics

Push-sum Distributed Dual Averaging Online Convex Optimization With Bandit Feedback

Toward joint utilization of absolute and relative bandit feedback for conversational recommendation

The Online Saddle Point Problem and Online Convex Optimization with Knapsacks

Distributed Online Stochastic-Constrained Convex Optimization With Bandit Feedback.

Safe Pricing Mechanisms for Distributed Resource Allocation with Bandit Feedback

Efficient UCB-Based Assignment Algorithm Under Unknown Utility with Application in Mentor-Mentee Matching

Master-Slave Deep Architecture for Top- K Multiarmed Bandits With Nonlinear Bandit Feedback and Diversity Constraints.

Distributed No-regret Learning in Aggregative Games with Residual Bandit Feedback

No-regret learning for repeated non-cooperative games with lossy bandits

Online Learning With Incremental Feature Space and Bandit Feedback

Decentralized multi-task stochastic optimization with compressed communications

Distributed bandit online optimisation for energy management in smart grids

Distributed online bandit linear regressions with differential privacy

Distributed Task Management in Fog Computing: A Socially Concave Bandit Game

Online distributed detection of sensor networks with delayed information

Constrained distributed online convex optimization with bandit feedback for unbalanced digraphs

Technical Note—On Adaptivity in Nonstationary Stochastic Optimization with Bandit Feedback

Online bandit convex optimisation with stochastic constraints via two-point feedback

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Bandit Feedback Research Articles

Articles published on Bandit Feedback

Learning the Optimal Control for Evolving Systems with Converging Dynamics

Learning the Optimal Control for Evolving Systems with Converging Dynamics

Push-sum Distributed Dual Averaging Online Convex Optimization With Bandit Feedback

Toward joint utilization of absolute and relative bandit feedback for conversational recommendation

The Online Saddle Point Problem and Online Convex Optimization with Knapsacks

Distributed Online Stochastic-Constrained Convex Optimization With Bandit Feedback.

Safe Pricing Mechanisms for Distributed Resource Allocation with Bandit Feedback

Efficient UCB-Based Assignment Algorithm Under Unknown Utility with Application in Mentor-Mentee Matching

Master-Slave Deep Architecture for Top- K Multiarmed Bandits With Nonlinear Bandit Feedback and Diversity Constraints.

Distributed No-regret Learning in Aggregative Games with Residual Bandit Feedback

No-regret learning for repeated non-cooperative games with lossy bandits

Online Learning With Incremental Feature Space and Bandit Feedback

Decentralized multi-task stochastic optimization with compressed communications

Distributed bandit online optimisation for energy management in smart grids

Distributed online bandit linear regressions with differential privacy

Distributed Task Management in Fog Computing: A Socially Concave Bandit Game

Online distributed detection of sensor networks with delayed information

Constrained distributed online convex optimization with bandit feedback for unbalanced digraphs

Technical Note—On Adaptivity in Nonstationary Stochastic Optimization with Bandit Feedback

Online bandit convex optimisation with stochastic constraints via two-point feedback