Online Multi-player Resource-Sharing Games with Bandit Feedback

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Abstract This paper considers an online multi-player resource-sharing game with bandit feedback. Multiple players choose from a finite collection of resources in a time slotted system. In each time slot, each resource brings a random reward that is equally divided among the players who choose it. The reward vector is independent and identically distributed over the time slots. The statistics of the reward vector are unknown to the players. During each time slot, for each resource chosen by the first player, they receive as feedback the reward of the resource and the number of players who chose it, after the choice is made. We develop a novel Upper Confidence Bound (UCB) algorithm that learns the mean rewards using the feedback and maximizes the worst-case time-average expected reward of the first player. The algorithm gets within $$\mathcal {O}(\log (T)/\sqrt{T})$$ of optimality within T time slots. The simulations depict fast convergence of the learnt policy in comparison to the worst-case optimal policy.

Similar Papers
  • Conference Article
  • 10.65109/cagk5713
Improving Mobile Maternal and Child Health Care Programs: Collaborative Bandits for Time Slot Selection
  • May 6, 2024
  • Soumyabrata Pal + 4 more

Maternal and child health is a global priority, reflected in the UN Sustainable Development Goal 3.1. Mobile health (mHealth) programs, using automated voice messages, are a vital tool for NGOs to disseminate health information in underserved communities. However, these programs face challenges: limited beneficiary phone access and unknown time preferences hinder timely outreach, leading to poor engagement. We address this by formulating the time preference inference problem as a multi-agent multi-armed bandit optimization problem, where beneficiaries are modeled as agents, and time slots as arms. We introduce a novel online collaborative filtering framework that infers preferred time slots by collaborating across beneficiaries to quickly identify their preferred time slots. To highlight the scope and impact of this problem, we are working with Kilkari, the world's largest maternal and child mHealth program serving millions in India every week. Kilkari faces substantial reattempt costs to improve call answer rates. Through extensive experiments on real-world data obtained from Kilkari, we demonstrate that our collaborative bandit framework significantly outperforms both existing policies used by the NGO, and popular non-collaborative bandit algorithms (e.g., Upper Confidence Bound), both in terms of number of call retries, saving critical bandwidth that enables wider outreach, and by rapidly learning optimal time slots, improving beneficiary engagement and retention.

  • Conference Article
  • 10.1109/msn.2010.25
Delay Analysis for Different Resource Allocation Schemes in Wireless Networks
  • Dec 1, 2010
  • Hongkun Li + 1 more

In this paper, we study the delay performance of wireless network considering different resource allocation schemes with single-hop traffic. Existing works studying the delay performance only consider a given resource allocation scheme, either multi-channel system (sharing bandwidth) or time slotted system (sharing time). The fundamental question ignored is which type of resource allocation scheme produces better delay performance with different network configurations, such as number of commodities, traffic statistics. We investigate the impact of different resource allocation schemes on the delay performance. A new arrival mode is designed for the time slotted system to reduce the average delay. We also construct the delay lower bound taking the perfect scheduling policy and queue management into account. We get four important conclusions from the numerical results: 1) the new arrival mode produces better delay performance than the regular mode, and it is immune to the change of time slot length. 2) time slotted system has better delay performance than multi-channel system, and almost achieves the lower bound, 3) the scalability of the multi-channel system is not good, since the delay will be very large with a large number of commodity flows. While time slotted system is scalable with a converging delay value with the infinite number of commodities. 4) both the multi-channel system and time slotted system are sensitive to the difference between arrival rate and service rate, which means that the delay is large when arrival rate is close to service rate.

  • Research Article
  • Cite Count Icon 1
  • 10.1051/itmconf/20257301011
Application of Multi-Armed Bandit Algorithm in Quantitative Finance
  • Jan 1, 2025
  • ITM Web of Conferences
  • Chengxun Chen + 3 more

The volatility and diversity of financial markets make it challenging for a single portfolio achieve better returns, therefore, adjustable portfolios based on the risk tolerance of clients are highly demanded. However, traditional portfolio strategies cannot meet this requirement. Regarding this issue, the paper combines Fuzzy C-means (FCM) with the Upper Confidence Bound (UCB) algorithm, Genetic Algorithm (GA) optimizing UCB parameters (GA-UCB) and UCB redefining the fitness of GA (UCB-GA) to construct an investment portfolio strategy that can be dynamically adjusted. The research methodology is as follows: the assets are grouped by FCM, using UCB to find the best cluster among the groups; UCB, UCB-GA, and GA-UCB are used to refine the weight distribution of the best cluster. The result shows that the cumulative return of the cluster recommended by the UCB is significantly higher than that recommended by FCM, the Sortino Ratio is improved by 1.18, and the Maximum Drawdown is reduced by 8%. In terms of the weights of the optimal cluster; the portfolio strategy from GA-UCB has the highest cumulative return of approximately 250% in algorithms. The Sortino Ratio of the GA-UCB is the largest at 3.23, which is 1.5 and 1.63 higher than the UCB and the UCB-GA, respectively. In addition, the Maximum Drawdown of the GA-UCB is 26%, which is 1% lower than UCB-GA and 3% lower than UCB. Combining FCM and GA- UCB can improve investment return and stability by adjusting the portfolio weight, which leads to better return risk ratios.

  • Conference Article
  • Cite Count Icon 16
  • 10.1109/ijcnn.2014.6889390
The scalarized multi-objective multi-armed bandit problem: An empirical study of its exploration vs. exploitation tradeoff
  • Jul 1, 2014
  • Saba Q Yahyaa + 2 more

The multi-armed bandit (MAB) problem is the simplest sequential decision process with stochastic rewards where an agent chooses repeatedly from different arms to identify as soon as possible the optimal arm, i.e. the one of the highest mean reward. Both the knowledge gradient (KG) policy and the upper confidence bound (UCB) policy work well in practice for the MAB-problem because of a good balance between exploitation and exploration while choosing arms. In case of the multi-objective MAB (or MOMAB)-problem, arms generate a vector of rewards, one per arm, instead of a single scalar reward. In this paper, we extend the KG-policy to address multi-objective problems using scalarization functions that transform reward vectors into single scalar reward. We consider different scalarization functions and we call the corresponding class of algorithms scalarized KG. We compare the resulting algorithms with the corresponding variants of the multi-objective UCBl-policy (MO-UCB1) on a number of MOMAB-problems where the reward vectors are drawn from a multivariate normal distribution. We compare experimentally the exploration versus exploitation trade-off and we conclude that scalarized-KG outperforms MO-UCB1 on these test problems.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-3-030-21077-9_7
Some Variations of Upper Confidence Bound for General Game Playing
  • Jan 1, 2019
  • Iván Francisco-Valencia + 2 more

Monte Carlo Tree Search (MCTS) is the most used method in General Game Playing, area of the Artificial Intelligence, whose main goal is to develop agents capable of play any board game without preview knowledge. MCTS requires a tree which represents the states and moves of the board game which is visited and expanded using an iterations method. In order to visit the tree, MCTS requires a selection policy which determines which node is visited in each level. Nowdays, Upper Confidence Bound (UCB), is the most popular policy in MCTS due to its simplicity and efficiency. This policy was propose for the Multi-Armed Bandit Problem (MABP) which consists in set of slot machines each of which has a certain probability of give a reward. The goal is to maximize the accumulative reward that is obtained when a machine is played in a series of rounds. Other policy proposed for MCTS is Upper Confidence Bound\(_{\sqrt{.}}\) (UCB\(_{\sqrt{.}}\)) whose goal is to identify the machine with the highest probability to give a reward. This paper shows a comparative between five modifications of UCB and one of UCB\(_{\sqrt{.}}\), this comparative has the goal of finding a policy which be able to identify the optimal machine as quickly as possible, this goal in MCTS is equals to identify the node with the highest probability to leading to a victory. The results show that some policies find the optimal machine before UCB, however, with 10,000 rounds UCB is the policy who plays the optimal machine more often.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.1186/s13638-020-01738-w
Distributed algorithm under cooperative or competitive priority users in cognitive networks
  • Jul 8, 2020
  • EURASIP Journal on Wireless Communications and Networking
  • Mahmoud Almasri + 5 more

Opportunistic spectrum access (OSA) problem in cognitive radio (CR) networks allows a secondary (unlicensed) user (SU) to access a vacant channel allocated to a primary (licensed) user (PU). By finding the availability of the best channel, i.e., the channel that has the highest availability probability, a SU can increase its transmission time and rate. To maximize the transmission opportunities of a SU, various learning algorithms are suggested: Thompson sampling (TS), upper confidence bound (UCB), ε-greedy, etc. In our study, we propose a modified UCB version called AUCB (Arctan-UCB) that can achieve a logarithmic regret similar to TS or UCB while further reducing the total regret, defined as the reward loss resulting from the selection of non-optimal channels. To evaluate AUCB’s performance for the multi-user case, we propose a novel uncooperative policy for a priority access where the kth user should access the kth best channel. This manuscript theoretically establishes the upper bound on the sum regret of AUCB under the single or multi-user cases. The users thus may, after finite time slots, converge to their dedicated channels. It also focuses on the Quality of Service AUCB (QoS-AUCB) using the proposed policy for the priority access. Our simulations corroborate AUCB’s performance compared to TS or UCB.

  • Research Article
  • Cite Count Icon 18
  • 10.1109/tsp.2018.2870383
Regional Multi-Armed Bandits With Partial Informativeness
  • Nov 1, 2018
  • IEEE Transactions on Signal Processing
  • Zhiyang Wang + 2 more

We consider a variant of the classic multi-armed bandit problem where the expected reward of each arm is a function of an unknown parameter. The arms are divided into different groups, each of which has a common parameter. Therefore, when the player selects an arm at each time slot, information of other arms in the same group is also revealed. This regional bandit model naturally bridges the classical non-informative bandit setting where the player can only learn the chosen arm, and the global bandit model where sampling one arm reveals information of all arms. We propose an efficient algorithm, UCB-g , that solves the regional bandit model by combining the Upper Confidence Bound (UCB) and greedy principles. Both parameter-dependent and parameter-free regret upper bounds are derived. We also establish a matching lower bound, which proves the order optimality of UCB-g . Moreover, we propose SW-UCB-g , which is an extension of UCB-g for a non-stationary environment where the parameters vary over time.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/wowmom54355.2022.00029
Learning the Optimal Controller Placement in Mobile Software-Defined Networks
  • Jun 1, 2022
  • Iordanis Koutsopoulos

We formulate and study the problem of online learning of the optimal controller selection policy in mobile Software-Defined Networks, where the controller-switch round-trip-time (RTT) delays are unknown and time-varying. Static optimization approaches are not helpful, since delays vary significantly (and sometimes, arbitrarily) from one slot to another, and only RTT delays from the current active controller can be easily measured. First, we model the sequence of RTT delays across time as a stationary random process so that the value at each time slot is a sample from an unknown probability distribution with unknown mean. This approach is applicable in relatively static network settings, where stationarity can be assumed. We cast the problem as a stochastic multiarmed bandit, where the arms are the different controller choices, and we fit different bandit algorithms to that setting, such as: the Lowest Confidence Bound (LCB) algorithm by modifying the known Upper Confidence Bound (UCB) one, the LCB-tuned one, and the Boltzmann exploration one. The first two are known to achieve sublinear regret, while the last one turns out to be very efficient. In a second approach, the random process of RTTs is non-stationary and thus cannot be characterized statistically. This scenario is applicable in cases of arbitrary mobility and other dynamics that affect RTT delays in an unpredictable, adversarial manner. We pose the problem as an adversarial bandit that can be solved with the EXP3 algorithm which achieves sublinear regret. We argue that all approaches can be implemented in an SDN environment with lightweight messaging. We also compare the performance of these algorithms for different problem settings and hyper-parameters that reflect the efficiency of the learning process. Numerical evaluation shows that Boltzmann exploration achieves the best performance.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/iswcs.2018.8491080
Optimal Compression and Transmission Policies for Energy Harvesting Nodes
  • Aug 1, 2018
  • Hamed Mirghasemi + 2 more

We consider an energy harvesting transmitter which may need to compress received packets before forwarding them over a flat fading channel. Data compression is required to meet the bandwidth or energy constraint at the cost of data distortion. The objective is to design optimal compression and transmission policies, namely optimal transmission and compression powers, transmission and compression rates and transmission and compression times, such that the total distortion is minimized. In this paper, we consider a time slotted system where new data and energy packets arrive at the beginning of each time slot (TS) and channel gains are assumed to remain constant during each TS. Under the assumption that the energy and data arrivals and channel gains are known non-causally which corresponds to offline optimization, we formulate the compression and transmission scheduling optimization as a convex optimization problem and characterize the properties of optimal scheduling. For the strict delay case where the transmission and compression of each packet must be executed within the corresponding TS, we provide an iterative algorithm which mimics the iterative directional water-filling (IDWF) algorithm. Numerical results are provided to illustrate our results and the properties of optimal scheduling.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/icassp.2017.7952855
Enhancing QoS in spatially controlled beamforming networks via distributed stochastic programming
  • Mar 1, 2017
  • Dionysios S Kalogerias + 1 more

We address the problem of enhancing Quality-of-Service (QoS) in power constrained, mobile relay beamforming networks, by controlling the motion of the relaying nodes. We consider a time slotted system, where the relays update their positions before the beginning of each time slot. Adopting a spatiotemporal stochastic field model of the wireless channel, we propose a novel 2-stage stochastic programming formulation for specifying the relay positions at each time slot, such that the QoS of the network is maximized on average, based on causal Channel State Information (CSI) and under a total relay transmit power budget. Via the Method of Statistical Differentials, the motion control problem considered is shown to be approximately equivalent to a set of simple subproblems, which are solved in a distributed fashion, one at each relay. Numerical simulations are also presented, corroborating the efficacy of the proposed approach.

  • Conference Article
  • 10.1109/icspcc.2017.8242633
Keynote speaker 1: Enhancing QoS in beamforming networks: Mobile beamformers and optimal motion policies
  • Oct 1, 2017
  • Athina P Petropulu

Distributed, networked communication systems, such as relay beamforming networks are typically designed without considering how the positions of the respective nodes might affect the quality of the communication. That is, network nodes are either assumed to be stationary in space, or, if some of them are moving while communicating, their trajectories are assumed to be independent of the respective communication task. However, in most cases, the Channel State Information (CSI) observed by each network node, per channel use is both spatially and temporally correlated. One could then ask whether the performance of the communication system could be improved by predictively controlling the positions of the network nodes (e.g., the relays), based on causal CSI estimates and by exploiting the spatiotemporal dependencies of the communication medium. In this talk, we address the problem of enhancing Quality-of-Service (QoS) in power constrained, mobile relay beamforming networks, by optimally exploiting relay mobility. We consider a time slotted system, where the relays update their positions before the beginning of each time slot. Adopting a spatiotemporal stochastic field model of the wireless channel, we propose a novel 2-stage stochastic programming formulation for specifying the relay positions at each time slot, such that the QoS of the network is maximized on average, based on causal CSI and under a total relay transmit power budget. The motion control problem considered is shown to be approximately equivalent to a set of simple subproblems, which can be solved in a distributed fashion, one at each relay. Numerical simulations are presented, corroborating the efficacy of the proposed approach and confirming its properties.

  • Conference Article
  • Cite Count Icon 10
  • 10.1109/isit.2015.7282633
Distributed compression and transmission with energy harvesting sensors
  • Jun 1, 2015
  • Rajeev Gangula + 2 more

We determine the achievable distortion region when the correlated source samples are transmitted by two energy harvesting (EH) sensor nodes to the destination over orthogonal fading channels. A time slotted system is considered in which the energy and the source samples arrive at the beginning of each time slot (TS), and both the correlation between source samples at the two nodes and fading coefficients change over time but remain constant in each TS. Assuming non-causal knowledge of these time-varying source statistics, energy arrivals and the channel gains, i.e., under the offline optimization framework, we obtain the optimal transmission and coding schemes that achieve the points on the Pareto boundary of the total distortion region. An iterative directional 2D waterfilling algorithm is proposed to obtain two specific points on this boundary.

  • Research Article
  • Cite Count Icon 12
  • 10.1145/3447380
Federated Bandit
  • Feb 18, 2021
  • Proceedings of the ACM on Measurement and Analysis of Computing Systems
  • Zhaowei Zhu + 3 more

In this paper, we study Federated Bandit, a decentralized Multi-Armed Bandit problem with a set of N agents, who can only communicate their local data with neighbors described by a connected graph G. Each agent makes a sequence of decisions on selecting an arm from M candidates, yet they only have access to local and potentially biased feedback/evaluation of the true reward for each action taken. Learning only locally will lead agents to sub-optimal actions while converging to a no-regret strategy requires a collection of distributed data. Motivated by the proposal of federated learning, we aim for a solution with which agents will never share their local observations with a central entity, and will be allowed to only share a private copy of his/her own information with their neighbors. We first propose a decentralized bandit algorithm \textttGossip\_UCB, which is a coupling of variants of both the classical gossiping algorithm and the celebrated Upper Confidence Bound (UCB) bandit algorithm. We show that \textttGossip\_UCB successfully adapts local bandit learning into a global gossiping process for sharing information among connected agents, and achieves guaranteed regret at the order of O(\max\ \textttpoly (N,M) łog T, \textttpoly (N,M)łog_łambda_2^-1 N\ ) for all N agents, where łambda_2\in(0,1) is the second largest eigenvalue of the expected gossip matrix, which is a function of G. We then propose \textttFed\_UCB, a differentially private version of \textttGossip\_UCB, in which the agents preserve ε-differential privacy of their local data while achieving O(\max \\frac\textttpoly (N,M) ε łog^2.5 T, \textttpoly (N,M) (łog_łambda_2^-1 N + łog T) \ ) regret.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/cec.2016.7744315
Product selection based on upper confidence bound MOEA/D-DRA for testing software product lines
  • Jul 1, 2016
  • Thiago Do Nascimento Ferreira + 3 more

The selection of products for testing Software Product Lines (SPLs) is an optimization problem. The goal is to select a possible minimum set of products that satisfies testing criteria, such as, pairwise and mutation testing. Multi-objective Evolutionary Algorithms (MOEAs) have been successfully used to solve this problem and other ones related to software development. However, the use of MOEAs demands setting a number of control parameters and selection of genetic operators, to which the algorithm performance is often very sensitive. Adaptive Operator Selection (AOS) methods, such as Upper Confidence Bound (UCB) based ones can help in this task. UCB methods used with Multi-objective Evolutionary Algorithm Based on Decomposition with Dynamical Resource Allocation (MOEA/D-DRA) have presented promising results, but they are underexplored in the Search Based Software Engineering (SBSE) field. To contribute to this research area and to solve efficiently the product selection problem, this paper investigates the use of different AOS UCB-based methods with MOEA/D-DRA. The idea is to reduce effort spent by the tester. Some parameters and evolutionary operators can be automatically set. The approach is empirical evaluated using four instances and three UCB methods. The UCB methods present similar results and outperform the canonical version of MOEA/D-DRA.

  • PDF Download Icon
  • Research Article
  • 10.54254/2755-2721/83/2024glg0077
Applications and Advances of UCB Algorithms in Dynamic and Contaminated Environments
  • Oct 31, 2024
  • Applied and Computational Engineering
  • Mengxuan Du

The Upper Confidence Bound (UCB) algorithm is a widely used approach in the Multi-Armed Bandit (MAB) problem, where the goal is to maximize cumulative rewards over time by selecting the best possible action among several options. The UCB algorithm uses confidence bounds to balance exploration and exploitation, guiding its decision-making. In recent years, researchers have identified significant challenges in applying the UCB algorithm to dynamic and contaminated environments. In such scenarios, the underlying conditions may change over time, making it difficult for standard UCB to adapt, or the data may be polluted by noise and outliers, leading to incorrect estimations of reward distributions. To address these challenges, several variants of the UCB algorithm have been developed. These new approaches are designed to better handle the complexities of changing environments and data contamination, ensuring more robust and reliable performance in these difficult settings. This paper aims to provide a comprehensive review of Robust-UCB (cr-UCB), Sliding Window UCB (SW-UCB) and bandit-over-bandit UCB (BOB-UCB). Focusing on their theoretical foundations, practical applications, and empirical performance. By examining how these algorithms have been adapted to handle the complexities of dynamic and contaminated environments, we found that the adaptability of these algorithms in dynamic environments is significantly improved, and they can effectively reduce decision-making errors caused by data pollution, thus providing a more reliable solution to the multi-armed bandit problem.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.