Articles published on Multi-armed bandit
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
2626 Search results
Sort by Recency
- New
- Research Article
- 10.1007/s10732-025-09578-x
- Jan 14, 2026
- Journal of Heuristics
- Long Wang + 3 more
Multi-armed Bandit and Backbone boost Lin-Kernighan-Helsgaun Algorithm for the Traveling Salesman Problem and its Variants
- New
- Research Article
- 10.1080/0305215x.2025.2609847
- Jan 9, 2026
- Engineering Optimization
- Yuehong Sun + 3 more
In real-world optimization problems, it is crucial to effectively exploit the synergies among tasks. Recent advances in evolutionary many-task optimization have highlighted adaptive knowledge transfer, particularly concerning transfer frequency and auxiliary task selection. However, the prevailing methods have some limitations, including complex parameter setting for auxiliary task selection and the inaccuracy of population-based task similarity measurement. This article proposes an adaptive many-task optimization algorithm with dual knowledge transfer (MTO-ADKT). First, an enhanced multi-armed bandit mechanism automates the selection of source tasks using a minimum of parameters. Secondly, the timing of knowledge transfer is optimized through dynamic adjustment of the transfer frequency. Thirdly, an elite-driven adaptation strategy enhances the precision of knowledge transfer. Through extensive experimentation, the effectiveness of MTO-ADKT was validated on four benchmark test sets with two, 10 and 50 tasks, and a planar manipulator control problem with 500 tasks, demonstrating its superior performance to seven state-of-the-art algorithms.
- New
- Research Article
- 10.1007/s00530-025-02115-7
- Jan 3, 2026
- Multimedia Systems
- Chaofeng Li + 2 more
FedMAB: adaptive multimodal federated learning with multi-armed bandits
- New
- Research Article
- 10.1016/j.jbi.2026.104987
- Jan 1, 2026
- Journal of biomedical informatics
- Xenia Konti + 7 more
A federated learning framework for ethical dynamic treatment allocation across heterogeneous hospitals.
- New
- Research Article
- 10.1109/tccn.2025.3602862
- Jan 1, 2026
- IEEE Transactions on Cognitive Communications and Networking
- Huibin Liang + 4 more
Joint Two Stage Beamforming and User Scheduling for LEO Satellite Communications via Hierarchical Multi-Armed Bandit
- New
- Research Article
- 10.1016/j.engappai.2025.113088
- Jan 1, 2026
- Engineering Applications of Artificial Intelligence
- Sharyal Zafar + 4 more
Decentralized multi-agent multi-armed bandits for smart electric vehicles charging
- New
- Research Article
- 10.54254/2755-2721/2026.tj30959
- Dec 31, 2025
- Applied and Computational Engineering
- Yang Gou
South Korea is facing a significant low-fertility rate issue, with varying success in fertility policy outcomes between urban and rural areas. The traditional fixed-area pilot model struggles to adapt to non-stationary fluctuations in fertility rates, leading to high trial-and-error costs. This study addresses the optimization of urban-rural fertility policies by proposing an enhanced Sliding Window Upper Confidence Bound (SW-UCB) algorithm that combines a sliding window with a forgetting factor. It treats Seoul and South Jeolla Province as arms of a Multi-Armed Bandit model, defining the increase in fertility rate per unit subsidy as the reward and conducting a simulated 60-month pilot. The improved algorithm demonstrates a 22.3% reduction in cumulative regret compared to the traditional UCB algorithm and a 19.4% reduction compared to Thompson Sampling, effectively accommodating fluctuations in fertility rates and aiding the precise adjustment of policies.
- New
- Research Article
- 10.1007/s10994-025-06943-6
- Dec 25, 2025
- Machine Learning
- Keqin Liu + 2 more
Extended UCB Policies for Multi-armed Bandit Problems
- Research Article
- 10.1080/10543406.2025.2602463
- Dec 22, 2025
- Journal of Biopharmaceutical Statistics
- Masahiro Kojima + 1 more
ABSTRACT In cancer phase I dose-finding clinical trials, following the Project Optimus of the U.S. Food and Drug Administration, there is an increasing need to add backfill cohorts not only to determine the maximum tolerated dose but also to identify the optimal effective dose for subsequent Phase II trials. In this study, multi-armed bandit algorithms, known for their superiority in exploratory selection, are applied to the problem of selecting which dose levels to add backfill cohorts. Additionally, we propose a method that incorporates efficacy modeling into the multi-armed bandit algorithm. The multi-armed bandit approach is straightforward, requiring only simple calculations, making it highly accessible. To enhance understanding of backfill cohort addition using multi-armed bandit algorithms, we provide demonstrations and evaluate the performance through simulations.
- Research Article
- 10.1103/zxx8-j2d4
- Dec 19, 2025
- Physical Review A
- Honoka Shiratori + 5 more
Quantum computing has the potential to solve complex problems faster and more efficiently than classical computing. It can achieve speedups by leveraging quantum phenomena like superposition, entanglement, and tunneling. Quantum walks (QWs) form the foundation for many quantum algorithms. Unlike classical random walks, QWs exhibit quantum interference, leading to unique behaviors such as linear spreading and localization. These properties make QWs valuable for various applications, including universal computation, time series prediction, encryption, and quantum hash functions. One emerging application of QWs is decision-making. Previous research has used QWs to model human decision processes and solve multiarmed bandit problems. This paper extends QWs to collective decision-making, focusing on minimizing decision conflicts, cases where multiple agents choose the same option, leading to inefficiencies like traffic congestion or overloaded servers. Prior research using quantum interference has addressed two-player conflict avoidance but struggled with three-player scenarios. This paper proposes a method using QWs to entirely eliminate decision conflicts in three-player cases, demonstrating its effectiveness in collective decision-making.
- Research Article
- 10.1021/acs.jctc.5c01431
- Dec 16, 2025
- Journal of chemical theory and computation
- Qia Ke + 4 more
For applications in gas sensing, purification, and capture, we often wish to search a large set of metal-organic frameworks (MOFs) for the top-K in terms of their Henry coefficients for an adsorbate. A molecular simulation to predict the Henry coefficient of a MOF constitutes a Monte Carlo integration where each sample consists of inserting an adsorbate in the MOF at a random position, orientation, and configuration, then calculating the MOF-adsorbate interaction energy. Our idea is to leverage top-K arm identification algorithms, developed for the multi-armed bandit problem in reinforcement learning, to sequentially and adaptively allocate adsorbate insertions among the MOFs, in a data-driven manner, to obtain the most accurate top-K subset under a fixed insertion budget. By analogy, each MOF is a slot machine in a casino that, upon pulling its arm (inserting an adsorbate), offers a stochastic reward (a noisy estimate of its Henry coefficient) sampled from a static, unknown probability distribution. Each adaptive allocation algorithm (1) proceeds in a feedback loop of (i) allocate adsorbate insertions to MOF(s), (ii) update the running estimates of the Henry coefficients of the MOF(s), then (iii) judiciously allocate adsorbate insertions to the next MOF(s); (2) sequentially dials-up the fidelities of ongoing molecular simulations in the MOFs, giving a multifidelity computational screening; and (3) circumvents the need to hand-craft structural or chemical features of the MOFs for decision making. As a case study, we implement, benchmark, and analyze the sequential halving, successive accepts and rejects, and narrowing exploration (our proposed heuristic) algorithms to adaptively allocate xenon insertions to screen a set of ca. 300 MOFs for the top-K Xe Henry coefficient subset over differing insertion budgets. Provided with a sufficient budget, we find that these adaptive insertion algorithms can significantly reduce (by a factor of 2-3) the simple regret (sum of true minus empirical top-K true Henry coefficients) and error in the top-K subset of MOFs output by a computational screening. By another metric, adaptive insertion allocation provided a ca. 60% discount on the computational cost to identify the top-K MOFs with less than 5% error. We thereby demonstrate that top-K arm identification algorithms may generally be useful for more efficiently screening materials for various properties via Monte Carlo molecular simulations. This efficiency improvement is especially important when adopting more computationally expensive, sophisticated force fields or even ab initio calculations for the potential energy of configurations to lend higher-fidelity screenings.
- Research Article
- 10.52783/jisem.v10i63s.13854
- Dec 13, 2025
- Journal of Information Systems Engineering and Management
- Prathyusha Bhaskar Karnam
To stay competitive in the e-commerce space, retailers must always introduce new features to their website to enhance the user experience and improve operational performance. A/B Testing can help identify if these features have a positive impact by allowing retailers to conduct a controlled experimental design using statistical methods to experimentally compare a group of users who experience the new features against a group of users who do not experience the new features. When they are doing a test, a retailer must come up with a hypothesis that they will test, then measure and analyze by statistical methods a sample of users (for example, by setting up a sample size, choosing metrics and determining randomization procedures) to find statistically significant evidence that points to whether the test result is a real improvement or just caused by random chance. Besides this, if you fail to consider common pitfalls of the experimental design, like the novelty effect, selection bias, or problems related to multiple testing under the same conditions, the results of your experiments may be incorrect. Traditional A/B Testing is a reasonable tool for use in most experimental needs; however, advanced experimentation techniques provide additional capabilities when conducting experiments under more complex scenarios. Multi-Armed Bandit Algorithms can dynamically optimize the allocation of web traffic to different versions of a webpage while minimizing the cost associated with directing users to suboptimal experiences. Lastly, causal inference methods can enable businesses to measure the impact of changes on their websites without using randomization methodology, thereby allowing them to evaluate the effectiveness of changes across the entire platform. "Trustworthy Experimentation" combines a solid foundation of statistical methodology with experience gained from actual business decisions to help businesses learn and iterate more rapidly while reducing risk. The principles of valid experimentation provide a framework for businesses to implement valid experiments to support evidence-based decision-making as it relates to e-commerce.
- Research Article
- 10.1038/s41598-025-31829-x
- Dec 8, 2025
- Scientific Reports
- Jiujia Zhang + 11 more
Community-based tuberculosis screening using mobile X-ray units can effectively increase case detection rates by reducing barriers to accessing services. This study evaluated the multi-armed bandit (MAB) framework, a machine learning approach, for optimizing mobile screening locations. Using simulations, we compared two MAB algorithms—Exp3 and LinUCB—with strategies based on historical case rates and random placement. The MAB algorithms continually updated site selection based on observed screening yields, and LinUCB additionally incorporated local socioeconomic indicators associated with tuberculosis rates. Over three years, assuming two mobile units serving 95 sites in Lima, Peru, 1,000 simulations demonstrated the MAB algorithms significantly reduced the average number of screenings needed to detect one individual with tuberculosis: 112 (standard deviation [SD]: 10) for Exp3 and 79 (SD: 12) for LinUCB, versus 152 (SD: 11) for random placement and 143 (SD: 11) for historic case-rate-driven placement. LinUCB performed best, achieving a 20% increase in detection efficiency by week 16 and 50% by week 40 compared to case-rate-driven placement. Overall, both MAB algorithms improved tuberculosis screening yields, emphasizing the value of data-driven approaches for optimizing mobile screening interventions. Incorporating adaptive models into screening programs may enhance targeting efficiency and offers a promising direction for policymakers and implementers seeking to optimize resource allocation in high-burden setting.Supplementary InformationThe online version contains supplementary material available at 10.1038/s41598-025-31829-x.
- Research Article
- 10.64898/2025.12.03.692237
- Dec 8, 2025
- bioRxiv : the preprint server for biology
- Meghan E Gallo + 10 more
Early life adversity (ELA) confers risk for reward-related psychopathologies. These risks may stem from adaptations optimizing reward pursuit in anticipation of unreliable, resource poor environments. One rational adaptation to poor, unreliable environments is Behavioral Opportunism: updating expectations more slowly and acting vigorously only when reward is immediately available. To systematically test the impact of ELA on behavioral strategies and underlying reward processing mechanisms, we exposed mice to resource restriction (limited bedding and nesting materials for 7 days) to manipulate the reliability and quality of early life care. Subsequently, we tested adults' reward learning and decision making in a two-arm bandit task and recorded dopamine signaling using dLight1.2 fiber photometry in the nucleus accumbens core. Exposure to ELA led to poorer choice discrimination, impaired learning, and decreased adaptation to changes in reward availability. Furthermore, ELA mice were slower to choose between levers but were faster to retrieve immediately available rewards when delivered, consistent with a strategy of behavioral opportunism. Dopamine signaling predicted behavior in both rearing conditions, and its fluctuations were strongly predictive of faster retrieval in ELA mice and an increased likelihood of choice repetition, implying that aberrant dopamine signals underlie slowed learning and vigorous action for immediately available rewards. To understand key features of maternal interactions driving these effects, we used home cage video monitoring to quantify maternal behaviors, continuously, across early life. We found that specific experiential outcomes, such as maternal kicking, intensified behavioral opportunism in adults, predicting poorer bandit task performance beyond the group effect of ELA. Behavioral opportunism provides an explanatory framework for interpreting altered reward processing and reward pursuit in adulthood for individuals exposed to ELA.
- Research Article
- 10.3390/electronics14244805
- Dec 6, 2025
- Electronics
- Jing Gao + 7 more
Unmanned aerial vehicles (UAVs) equipped with antenna arrays can deliver high-capacity, high-throughput, and low-latency communication services. Considering a UAV-assisted mmWave multi-input and multi-output (MIMO) system, a two-stage beamforming scheme based on a budgeted combinatorial multi-armed bandit (BC-MAB) is proposed to improve the system’s spectral efficiency (SE). The pre-beamformer design problem is initially formulated as a BC-MAB problem. In this framework, the reward is the received energy, while the cost corresponds to the energy consumed by each RF chain and the budget is represented by the residual energy of the UAV. To achieve a favorable trade-off between the number of communication slots and the energy acquired per slot, a pre-beamforming scheme based on the bang-per-buck ratio is introduced to optimize the number of activated RF chains, therefore maximizing the cumulative reward. The second stage utilizes the reduced-dimensional instantaneous channel state information to design and optimize the beamformer to achieve maximum system SE. The proposed scheme achieves more than 7.1% improvement in SE compared to the benchmark schemes. Simulations validate the superiority of the proposed scheme.
- Research Article
- 10.54254/2755-2721/2025.ld30217
- Dec 3, 2025
- Applied and Computational Engineering
- Xiwen Guo
Personalized medicine and adaptive clinical trials aim to match optimal treatment plans to individual patients while validating new therapies. Traditional fixed-design trials have limitations, including resource wastage and scarce data. Artificial intelligence has led to the development of dynamic decision algorithms like the Multi-Armed Bandit (MAB) algorithm, which balances exploration and exploitation in treatment allocation. Researchers aim to integrate MAB with clinical needs while adhering to ethical guidelines. This paper discusses the use of Multi-armed Bandit (MAB) algorithms in medicine, highlighting their potential for optimizing treatment allocation and improving patient outcomes, despite ethical constraints and limited data, and their application in contextual and reinforcement learning settings. This research highlight key clinical applications such as adaptive dose-finding, personalized treatment selection, and digital health interventions, supported by both trial-based data and large-scale public datasets like MIMIC-III. Simulation studies are also discussed as a necessary complement to real-world data, facilitating algorithm validation under ethical and logistical constraints. Comparative evaluation of algorithms demonstrates that Bayesian methods, particularly Thompson Sampling and contextual bandits, often provide a more robust balance between efficiency and safety. However, challenges remain in scalability, interpretability, and regulatory acceptance. This research conclude by identifying promising directions for future research, including the integration of deep reinforcement learning and causal inference, which may further enhance the role of MABs in advancing personalized medicine and adaptive clinical trial design.
- Research Article
- 10.54254/2755-2721/2025.ld30166
- Dec 3, 2025
- Applied and Computational Engineering
- Haokai Tang
The exploration-exploitation dilemma in multi-arm bandit problems has long been a classic challenge and serves as the foundation of reinforcement learning. It has applications in various industries, such as advertising online, A/B testing, and clinical medicine and so on. There are many MAB algorithms and each has its own advantages and disadvantages. This paper analyzes the performance of three classic MAB algorithms: the simple and effective -Greedy; the Upper Confidence Bound algorithm (UCB1)which is more optimistic when facing uncertainty; and Thompson Sampling, an approach rooted in Bayesian inference. This paper conducts simulation experiments under the Bernoulli Machine environment using three evaluation criteria: cumulative regret, convergence speed, and parameter dependence, and comprehensively analyzes the performance of the three algorithms. The results show that Thompson sampling achieved the lowest cumulative regret and the fastest convergence speed, followed by UCB1. The performance of -Greedy is highly sensitive to its hyperparameters. These findings may provide some practical guidance for algorithm selection in real-world scenarios with similar properties and validate the theoretical advantages of the probability matching strategy.
- Research Article
- 10.1016/j.asoc.2025.113718
- Dec 1, 2025
- Applied Soft Computing
- Qiang Lu + 5 more
Embedding neural sampling and adversarial bandit into gene expression programming for symbolic regression
- Research Article
- 10.1145/3771570
- Dec 1, 2025
- Proceedings of the ACM on Measurement and Analysis of Computing Systems
- Mengfan Xu + 5 more
We study a novel heterogeneous multi-agent multi-armed bandit problem with a cluster structure induced by stochastic block models (SBMs), influencing not only graph topology, but also reward heterogeneity. Specifically, agents are distributed on random graphs based on SBMs, a generalized Erdos-Renyi model with heterogeneous edge probabilities: agents are grouped into clusters (known or unknown); edge probabilities for agents within the same cluster differ from those across clusters. The same cluster structure in SBMs also determines our heterogeneous rewards. Rewards distributions of the same arm vary across agents in different clusters but remain consistent within a cluster, unifying homogeneous and heterogeneous settings and varying degree of heterogeneity. Rewards are independent samples from these distributions. The objective is to minimize system-wide regret across all agents. To address this, we propose a novel algorithm applicable to both known and unknown cluster settings. The algorithm combines an averaging-based consensus approach with a newly introduced information aggregation and weighting technique, resulting in a UCB-type strategy. It accounts for graph randomness, leverages both intra-cluster (homogeneous) and inter-cluster (heterogeneous) information from rewards and graphs, and incorporates cluster detection for unknown cluster settings. We derive optimal instance-dependent regret upper bounds of order log T under sub-Gaussian rewards. Importantly, our regret bounds capture the degree of heterogeneity in the system (an additional layer of complexity), exhibit smaller constants, scale better for large systems, and impose significantly relaxed assumptions on edge probabilities.
- Research Article
- 10.23919/icn.2025.0025
- Dec 1, 2025
- Intelligent and Converged Networks
- Yulei Wang + 3 more
Collaborative DNN Inference in Maritime Edge Intelligence Networks with Group Neural Multi-Armed Bandits