Discovery Logo
Sign In
Paper
Search Paper
Cancel
Pricing Sign In
  • My Feed iconMy Feed
  • Search Papers iconSearch Papers
  • Library iconLibrary
  • Explore iconExplore
  • Ask R Discovery iconAsk R Discovery Star Left icon
  • Chat PDF iconChat PDF Star Left icon
  • Citation Generator iconCitation Generator
  • Chrome Extension iconChrome Extension
    External link
  • Use on ChatGPT iconUse on ChatGPT
    External link
  • iOS App iconiOS App
    External link
  • Android App iconAndroid App
    External link
  • Contact Us iconContact Us
    External link
  • Paperpal iconPaperpal
    External link
  • Mind the Graph iconMind the Graph
    External link
  • Journal Finder iconJournal Finder
    External link
Discovery Logo menuClose menu
  • My Feed iconMy Feed
  • Search Papers iconSearch Papers
  • Library iconLibrary
  • Explore iconExplore
  • Ask R Discovery iconAsk R Discovery Star Left icon
  • Chat PDF iconChat PDF Star Left icon
  • Citation Generator iconCitation Generator
  • Chrome Extension iconChrome Extension
    External link
  • Use on ChatGPT iconUse on ChatGPT
    External link
  • iOS App iconiOS App
    External link
  • Android App iconAndroid App
    External link
  • Contact Us iconContact Us
    External link
  • Paperpal iconPaperpal
    External link
  • Mind the Graph iconMind the Graph
    External link
  • Journal Finder iconJournal Finder
    External link

Related Topics

  • Bandit Algorithm
  • Bandit Algorithm

Articles published on Thompson sampling

Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
413 Search results
Sort by
Recency
  • New
  • Research Article
  • 10.54097/schj0v87
A Comparative Analysis of Cumulative Regret Based on Multi-Armed Bandit Algorithms
  • Jan 29, 2026
  • Academic Journal of Science and Technology
  • Muqing Xue

This study aims to conduct a detailed comparison of the performance of three classic Multi-Armed Bandit algorithms: Thompson Sampling, UCB, and ETC. The MAB problem, as an important sequential decision-making framework, primarily challenges lie in how to strike a balance between "exploration" and "exploitation". We quantitatively analyzed each algorithm's long-term performance using 100 independent experiments and cumulative regret as the primary metric. The experimental findings demonstrate that the three algorithms' performance varies significantly in the tested context. The Thompson sampling method performed the best, with the least increase in regret and the lowest final value. The UCB algorithm performed second-best, with regret growing logarithmically. The ETC algorithm saw rapid accumulation of regret in the early stages before stabilizing, but it had the poorest performance because it lacked the ability for continuous exploration. These findings confirm that the Thompson sampling method is the most efficient in balancing exploration and exploitation, and is the best choice for solving such Random Stationary Multi-Armed Bandit Problem.

  • New
  • Research Article
  • 10.1016/j.cels.2025.101476
Risk-averse optimization of genetic circuits under uncertainty.
  • Jan 21, 2026
  • Cell systems
  • Michal Kobiela + 2 more

Risk-averse optimization of genetic circuits under uncertainty.

  • New
  • Research Article
  • 10.63593/jwe.2025.12.06
Cross-Border E-Commerce TikTok Live Streaming Data Three-Dimensional Optimization Model Construction and Empirical Study — Based on Singaporean Technology Product Markets and Scenario Migration to U.S. Warehousing Services
  • Jan 19, 2026
  • Journal of World Economy
  • Yiyang Wu

In zero-paid-traffic scenarios, TikTok technology live streams typically face a systemic dilemma characterized by scarce traffic entry points, inadequate audience retention, and depressed average order values. Extant research predominantly focuses on low-involvement product categories and paid growth strategies, leaving a theoretical gap in systematic investigation of organic growth mechanisms for high-involvement technology products. Grounded in attention economy theory and collaborative optimization theory, this study employs 42 live streams featuring 15 technology products from Tesen Global Technology in Singapore as a natural experiment. We construct a three-dimensional collaborative optimization model encompassing “time slots—scripts—product mix” and implement real-time attention allocation via online gradient descent algorithms, while dynamically iterating product combinations through Thompson Sampling. Validation using 73-day panel data demonstrates that post-intervention, organic follower growth increased by 208%, conversion rates rose by 125%, average order value climbed by 20.5%, and cumulative advertising expenditure savings reached $12,000; 5,000 randomization permutation tests confirm robust effects (p<0.01). Furthermore, applying service marketing theory, we migrate the model to the U.S. small-to-medium warehousing sector, proposing an “inventory turnover rate visualization live stream + service package matrix” approach, which is projected to reduce customer-per-lead costs (CPL) from $180 to $90. This research establishes a multi-dimensional collaborative optimization framework for live streaming, filling theoretical voids regarding high-involvement product growth in zero-ad-spend contexts and providing a replicable methodological paradigm for organic cross-border e-commerce expansion.

  • Research Article
  • 10.1002/sim.70386
Bayesian Response-Adaptive Randomization for Cluster Randomized Controlled Trials.
  • Jan 1, 2026
  • Statistics in medicine
  • Yunyi Liu + 2 more

Cluster randomized controlled trials where groups (or clusters) of individuals, rather than single individuals, are randomized are especially useful when individual-level randomization is not feasible or when interventions are naturally delivered at the group level. Balanced randomization in the cluster randomized trial setting can pose logistical challenges and strain resources if subjects are randomized to a non-optimal arm. We propose a Bayesian response-adaptive randomization design for cluster randomized controlled trials based on Thompson sampling, which dynamically allocates clusters to the most efficacious treatment arm based on the interim posterior distributions of treatment effects using Markov chain Monte Carlo sampling. Our design also incorporates early stopping rules for efficacy and futility determined by prespecified posterior probability thresholds. The performance of the proposed design is evaluated across various operating characteristics under multiple settings, including varying intra-cluster correlation coefficients, cluster sizes, and effect sizes. Our adaptive approach is also compared with a standard, parallel two-arm cluster randomized controlled clinical trial design, highlighting improvements in both ethical considerations and efficiency. From our simulation studies based on an HIV behavioral trial, we demonstrate these improvements by preferentially assigning more clusters to the more efficacious intervention while maintaining robust statistical power and controlling false positive rates.

  • Research Article
  • 10.54254/2755-2721/2026.tj30959
Dynamic Pilot Optimization of South Korea's Urban-Rural Fertility Policies Based on Improved Sliding Window UCB Algorithm
  • Dec 31, 2025
  • Applied and Computational Engineering
  • Yang Gou

South Korea is facing a significant low-fertility rate issue, with varying success in fertility policy outcomes between urban and rural areas. The traditional fixed-area pilot model struggles to adapt to non-stationary fluctuations in fertility rates, leading to high trial-and-error costs. This study addresses the optimization of urban-rural fertility policies by proposing an enhanced Sliding Window Upper Confidence Bound (SW-UCB) algorithm that combines a sliding window with a forgetting factor. It treats Seoul and South Jeolla Province as arms of a Multi-Armed Bandit model, defining the increase in fertility rate per unit subsidy as the reward and conducting a simulated 60-month pilot. The improved algorithm demonstrates a 22.3% reduction in cumulative regret compared to the traditional UCB algorithm and a 19.4% reduction compared to Thompson Sampling, effectively accommodating fluctuations in fertility rates and aiding the precise adjustment of policies.

  • Research Article
  • 10.54254/2755-2721/2025.ld30217
Multi-Armed Bandits and Clinical Medicine: A Survey of Algorithms, Evaluation, and Applications
  • Dec 3, 2025
  • Applied and Computational Engineering
  • Xiwen Guo

Personalized medicine and adaptive clinical trials aim to match optimal treatment plans to individual patients while validating new therapies. Traditional fixed-design trials have limitations, including resource wastage and scarce data. Artificial intelligence has led to the development of dynamic decision algorithms like the Multi-Armed Bandit (MAB) algorithm, which balances exploration and exploitation in treatment allocation. Researchers aim to integrate MAB with clinical needs while adhering to ethical guidelines. This paper discusses the use of Multi-armed Bandit (MAB) algorithms in medicine, highlighting their potential for optimizing treatment allocation and improving patient outcomes, despite ethical constraints and limited data, and their application in contextual and reinforcement learning settings. This research highlight key clinical applications such as adaptive dose-finding, personalized treatment selection, and digital health interventions, supported by both trial-based data and large-scale public datasets like MIMIC-III. Simulation studies are also discussed as a necessary complement to real-world data, facilitating algorithm validation under ethical and logistical constraints. Comparative evaluation of algorithms demonstrates that Bayesian methods, particularly Thompson Sampling and contextual bandits, often provide a more robust balance between efficiency and safety. However, challenges remain in scalability, interpretability, and regulatory acceptance. This research conclude by identifying promising directions for future research, including the integration of deep reinforcement learning and causal inference, which may further enhance the role of MABs in advancing personalized medicine and adaptive clinical trial design.

  • Research Article
  • 10.54254/2755-2721/2025.ld30166
Multi-arm Bandit Machine Exploration - Investigating the Performance Differences of Classical Algorithms Through Trade-off Analysis
  • Dec 3, 2025
  • Applied and Computational Engineering
  • Haokai Tang

The exploration-exploitation dilemma in multi-arm bandit problems has long been a classic challenge and serves as the foundation of reinforcement learning. It has applications in various industries, such as advertising online, A/B testing, and clinical medicine and so on. There are many MAB algorithms and each has its own advantages and disadvantages. This paper analyzes the performance of three classic MAB algorithms: the simple and effective -Greedy; the Upper Confidence Bound algorithm (UCB1)which is more optimistic when facing uncertainty; and Thompson Sampling, an approach rooted in Bayesian inference. This paper conducts simulation experiments under the Bernoulli Machine environment using three evaluation criteria: cumulative regret, convergence speed, and parameter dependence, and comprehensively analyzes the performance of the three algorithms. The results show that Thompson sampling achieved the lowest cumulative regret and the fastest convergence speed, followed by UCB1. The performance of -Greedy is highly sensitive to its hyperparameters. These findings may provide some practical guidance for algorithm selection in real-world scenarios with similar properties and validate the theoretical advantages of the probability matching strategy.

  • Research Article
  • 10.69987/jacs.2025.51201
Counterfactual Learning-to-Rank for Ads: Off-Policy Evaluation on the Open Bandit Dataset
  • Dec 3, 2025
  • Journal of Advanced Computing Systems
  • Hanqi Zhang

Reliable offline evaluation is a central bottleneck in ad recommendation and ranking systems: online A/B experiments are expensive, slow, and risky, while naive offline replay is biased when logs are collected by a non-random policy. Counterfactual learning-to-rank (LTR) and off-policy evaluation (OPE) address this bottleneck by leveraging logged bandit feedback with known propensities. This paper presents a reproducible experimental study of IPS/SNIPS/DR estimators and counterfactual policy construction in a multi-position setting using the Open Bandit Dataset (OBD) released by ZOZO. We evaluate estimator behavior in cross-policy settings (Random ↔ Bernoulli Thompson Sampling), characterize heavy-tailed importance weights, and study robustness under propensity clipping. We further construct stochastic ranking policies from a fitted reward model, including a diversity-aware slate policy, and quantify the CTR–diversity trade-off via a Pareto analysis. Finally, we conduct a semi-synthetic evaluation that preserves real OBD covariates but simulates rewards from a learned environment, enabling bias–variance curves under known ground truth. Across experiments, self-normalization and doubly robust corrections improve stability, while the dominant failure mode remains limited overlap that produces heavy-tailed weights; clipping mitigates variance at the cost of controlled bias.

  • Research Article
  • 10.54254/2753-8818/2026.ch30042
An Empirical Comparison of Bayesian LinUCB, UCB, and Thompson Sampling for Recommendation on MovieLens
  • Nov 26, 2025
  • Theoretical and Natural Science
  • Jingyun Wang

T Recommender systems have evolved into core business hubs, with approximately 35% of Amazon's revenue stemming from recommendation-guided behaviors. This study conducts a systematic comparative analysis of three multi-armed bandit algorithmsBayesian Linear Upper Confidence Bound (Bayesian LinUCB), Upper Confidence Bound (UCB), and Thompson Samplingusing the MovieLens dataset. The research evaluates algorithm performance across three key dimensions: cumulative regret, optimal arm selection frequency, and regret rate. Experimental variables are strictly controlled with consistent parameters, including decision steps and data division ratios to eliminate confounding factors. Results reveal significant performance differences among the algorithms within the limited experimental steps on the MovieLens dataset. UCB demonstrates optimal performance with the lowest cumulative regret (817.93) and highest optimal arm selection frequency (0.9822), followed by Thompson Sampling with moderate performance (cumulative regret: 2776.36, selection frequency: 0.924). Bayesian LinUCB performs poorly across all metrics, showing the highest cumulative regret (34105.02), lowest selection frequency (0.1324), and a regret rate of approximately 1, indicating linear rather than sublinear growth. The sublinear growth characteristic exhibited by UCB and Thompson Sampling confirms their superior exploration-exploitation balance, while Bayesian LinUCB's linear growth pattern suggests inadequate adaptation to the MovieLens dataset scenario, highlighting the importance of algorithm-dataset compatibility in recommendation systems.

  • Research Article
  • 10.54254/2753-8818/2026.ch29998
Contextual Multi-Armed Bandits for Dynamic News Recommendation: An Empirical Evaluation
  • Nov 26, 2025
  • Theoretical and Natural Science
  • Jiashuo Wang

With the advent of the information explosion era, personalized news recommendation faces critical challenges including cold start problems, real-time changes in user preferences, and information filter bubbles. Traditional collaborative filtering methods rely heavily on historical data and struggle to adapt to the rapid update characteristics of news content. This paper proposes a news recommendation solution based on Multi-Armed Bandit (MAB) algorithms, addressing these challenges by balancing exploration and exploitation. The study implements four core algorithms: -greedy algorithm balances exploration and exploitation through probability mechanisms; Upper Confidence Bound (UCB) algorithm employs optimistic estimation using confidence upper bounds; Thompson sampling adopts probability adaptation based on Bayesian framework; and Contextual Linear Bandit (LinUCB) integrates user and news features for personalized recommendations. Experiments Youdaoplaceholder0 on the MIND large-scale news dataset (containing 160,000 news articles, 1 million users) and 15 million click interactions) demonstrate that contextual bandit algorithms outperform traditional methods in click-through rate, dwell time, and recommendation diversity. Thompson sampling shows outstanding performance in click-through rates, while LinUCB excels in convergence speed and recommendation diversity. The experiments confirm that MAB algorithms can effectively adapt to dynamic changes in user preferences, providing a viable solution for real-time news recommendation systems.

  • Research Article
  • 10.54254/2753-8818/2026.ch30041
Comparative Analysis of ETC, UCB, and Thompson Sampling for Personalized Video Recommendations on Short-Video Platform
  • Nov 26, 2025
  • Theoretical and Natural Science
  • Shuqiao Chen

This study empirically compares three canonical Multi-Armed Bandit (MAB) algorithmsExplore-Then-Commit (ETC), fixed initial exploration, Upper Confidence Bound (UCB1), which is the optimism-driven uncertainty estimation, and Thompson Sampling (TS) with Bernoulli likelihood (TS-Bernoulli, posterior-sampling-based)for short-video recommendation, aiming to solve the exploration-exploitation tradeoff in real-time feed systems. Experiments were conducted on the ShortVideo-Interactions (SVI-200K) dataset, a simulated corpus with ~1.2 million timestamped impressions and clicks from 240,000 user sessions over 30 days, covering ~18,000 unique items to mimic real platform dynamics. Evaluations used a fixed horizon (T=2000 timesteps) and restricted candidates to the top 200 items (K=200) per run, spanning three practical scenarios: stable base, information-scarce cold-start (new items with no prior data), and preference-drifting temporal-shift. Results, aggregated over three pseudo-random seeds (2025, 2026, 2027), show TS-Bernoulli consistently outperforms peers: it achieves the highest Click-Through Rate (CTR) (0.452 in base, 0.402 in cold-start, 0.428 in temporal-shift) and lowest cumulative regret (418, 518, 467 respectively). These findings confirm that TS-Bernoullis posterior sampling enables robust adaptation to short-video recommendations key challenges (information scarcity and non-stationarity), providing a practical algorithm choice for real-world platforms.

  • Research Article
  • 10.1149/ma2025-0283605mtgabs
Enhancing Fast-Charging Protocols with Section-Based Bayesian Optimization for Lithium-Ion Batteries to Prevent Li-Plating
  • Nov 24, 2025
  • ECS Meeting Abstracts
  • Yoon-Mo Lee + 3 more

Global adoption of electric vehicles is accelerating to achieve carbon-neutrality goals, yet charging time remains a major barrier to user acceptance. Reports show that the Tesla Model 3 (82 kWh) and Hyundai Ioniq 5 (77.4 kWh) require about 25 and 18 minutes, respectively, to reach 80% state of charge (SOC), which is still longer than internal combustion refueling [1]. To address this, the U.S. Department of Energy (DOE) and the U.S. Advanced Battery Consortium (USABC) have targeted extreme fast charging to 80% SOC within 15 minutes [2]. However, high charging currents can cause premature cut-off, incomplete electrode utilization, and accelerated degradation, while non-uniform current distribution may trigger Li-plating near the separator [3,4]. Recently, model-based charging protocols with Bayesian optimization (BO) have gained traction [5–7], but they rarely incorporate direct constraints to suppress Li-plating. Here, we propose a framework that integrates a physics-based electrochemical model with BO to optimize fast-charging protocols for lithium-ion batteries [8,9]. The model, validated against experimental data with an average error of 25 mV in voltage and 0.26 °C in temperature across multiple C-rates, enables direct control of Li-plating potential as a safety constraint. Using a commercial 55.6 Ah pouch-type cell, two multi-step constant-current strategies were compared: a single-section protocol and a bi-section protocol that partitions the SOC window based on internal resistance. The bi-section approach reduced charging time by up to 11% relative to the single-section method, while maintaining plating-free operation and suppressing SEI growth. Under high-temperature conditions with preheated cells at 60 °C, the optimized protocol achieved 0–80% SOC in 629 s (10.5 min), thereby meeting the USABC 15-min target. Compared with the conventional CCCV method, the proposed BO-based protocols shortened charging time by up to 20% while reducing capacity degradation. Cycling tests directly compared the optimized BO-based protocols with the conventional CCCV method and revealed significantly improved capacity retention and reduced degradation over repeated operation. Post-mortem analyses including SEM, XPS, and EDS further confirmed that cells charged with the optimized protocols exhibited markedly less lithium deposition, thinner SEI layers, and more intact graphite morphology than those charged under CCCV. These results provide strong experimental validation that the proposed section-based BO framework not only reduces charging time but also extends cell lifetime by mitigating key degradation pathways. Overall, the study demonstrates the practical applicability of the optimized protocols for safe, efficient, and plating-free fast charging of large-format EV batteries under diverse thermal conditions. References Mateen S, Amir M, Haque A, Bakhsh FI. Ultra-fast charging of electric vehicles: a review of power electronics converter, grid stability and optimal battery consideration in multi-energy systems . Sustain Energy Grids 2023;35. Neubauer J, Pesaran A, Bae C, Elder R, Cunningham B. Updating United States Advanced Battery Consortium and Department of Energy battery technology targets for battery electric vehicles . J Power Sources. 2014;271:614-621 Yang XG, Wang CY. Understanding the trilemma of fast charging, energy density and cycle life of lithium-ion batteries . J Power Sources 2018;402:489– Lin XK, Khosravinia K, Hu XS, Li J, Lu W. Lithium plating mechanism, detection, and mitigation in lithium-ion batteries . Prog Energ Combust 2021;87. Jiang BB, Berliner MD, Lai K, Asinger PA, Zhao HB, Herring PK, Bazant MZ, Braatz RD. Fast charging design for Lithium-ion batteries via Bayesian optimization . Appl Energ 2022;307 Attia PM, Grover A, Jin N, Severson KA, Markov TM, Liao YH, Chen MH, Cheong B, Perkins N, Yang Z, Herring PK, Aykol M, Harris SJ, Braatz RD, Ermon S, Chueh WC. Closed-loop optimization of fast-charging protocols for batteries with machine learning . Nature 2020;578:397. +. Song XB, Jiang BB. Parallel Bayesian optimization using satisficing Thompson sampling for fast charging design of lithium-ion batteries . Eng Appl Artif Intell 2025;15 Doyle M, Newman J, Gozdz AS, Schmutz CN, Tarascon JM. Comparison of modeling predictions with experimental data from plastic lithium-ion cells. J Electro chem Soc 1996;143:1890– Arora P, Doyle M, Gozdz AS, White RE, Newman J. Comparison between computer simulations and experimental data for high-rate discharges of plastic lithium-ion batteries. J Power Sources 2000;88:219–3

  • Research Article
  • 10.3389/fpls.2025.1699124
Distributed multi-robot active gathering for non-uniform agriculture and forestry information
  • Oct 22, 2025
  • Frontiers in Plant Science
  • Jun Chen + 5 more

Active information gathering is a fundamental task in multi-robot systems in agriculture, with applications in precision planting and sowing, field management and inspection, intelligent weeding and pest control, etc. Traditional distributed strategies often struggle to adapt to environments where information of interest are unevenly clustered, leading to slow detection and inefficient coverage. In this paper, we reformulate the information gathering problem as a multi-armed bandit (MAB) problem and propose a novel distributed Bernoulli Thompson Sampling algorithm. Our approach enables robots to make exploration-exploitation decisions while sharing probabilistic information across the team, thus improving global coordination without centralized control. We further combine the distributed Bernoulli Thompson Sampling policy with Lloyd’s algorithm for dynamic target tracking and introduce a goal swapping strategy to improve task allocation efficiency. Extensive simulations demonstrate that our method significantly outperforms baseline approaches in terms of search speed and target coverage, particularly in scenarios with clustered target distributions.

  • Research Article
  • 10.1145/3771931
A Reward-Informed Semi-Personalized Bandit Approach for Enhancing Accuracy and Serendipity in Online Slate Recommendations
  • Oct 21, 2025
  • ACM Transactions on Recommender Systems
  • Lukas De Kerpel + 1 more

Contextual bandits provide a principled framework for personalization in online recommendation settings. However, as these methods tailor recommendation slates to an individual user, they tend to induce overspecialization, yielding homogeneous recommendation lists that limit exposure to diverse content and contribute to more systemic issues such as filter bubbles and echo chambers. To mitigate these effects, recommender systems must complement predictive accuracy with serendipity, providing recommendations that are novel and unexpected while remaining contextually relevant. This study proposes a semi-personalized bandit that, for each item, learns a decision tree to segment users by contextual features and reward patterns, and runs a unique Thompson Sampling policy for each user segment to create recommendation slates. By pooling information across behaviorally similar users and conducting the exploration mechanism at the user segment level, the framework mitigates overspecialization issues and promotes serendipitous recommendations. Moreover, the approach is inherently interpretable, with decision trees revealing decision pathways that define user segments, offering insights into recommendation logic. Experiments across three different online domains show that the semi-personalized framework reduces average regret relative to personalized baselines while improving serendipity in sparse interaction settings. These findings underscore the potential of semi-personalized bandits to improve recommendation quality in complex environments.

  • Research Article
  • 10.1186/s13321-025-01105-1
Enhanced Thompson sampling by roulette wheel selection for screening ultralarge combinatorial libraries
  • Oct 13, 2025
  • Journal of Cheminformatics
  • Hongtao Zhao + 5 more

Chemical space exploration has gained significant interest with the increasing availability of building blocks, enabling the creation of ultralarge virtual libraries containing billions or trillions of compounds. However, challenges remain in selecting the most suitable compounds for synthesis, especially in hit expansion. Thompson sampling, a probabilistic search method, has recently been proposed to improve efficiency by operating in reagent space rather than product space. Here, we address some of its limitations by introducing a roulette wheel selection method combined with a thermal cycling approach to balance greedy search and diversity-driven exploration. The effectiveness of this method is demonstrated through 109 queries against twenty distinct 1-million-compound libraries using ROCS.Supplementary InformationThe online version contains supplementary material available at 10.1186/s13321-025-01105-1.

  • Research Article
  • 10.3390/sym17101614
Statistical Learning-Assisted Evolutionary Algorithm for Digital Twin-Driven Job Shop Scheduling with Discrete Operation Sequence Flexibility
  • Sep 29, 2025
  • Symmetry
  • Yan Jia + 3 more

With the rapid development of Industry 5.0, smart manufacturing has become a key focus in production systems. Hence, achieving efficient planning and scheduling on the shop floor is important, especially in job shop environments, which are widely encountered in manufacturing. However, traditional job shop scheduling problems (JSP) assume fixed operation sequences, whereas in modern production, some operations exhibit sequence flexibility, referred to as sequence-free operations. To mitigate this gap, this paper studies the JSP with discrete operation sequence flexibility (JSPDS), aiming to minimize the makespan. To effectively solve the JSPDS, a mixed-integer linear programming model is formulated to solve small-scale instances, verifying multiple optimal solutions. To enhance solution quality for larger instances, a digital twin (DT)–enhanced initialization method is proposed, which captures expert knowledge from a high-fidelity virtual workshop to generate high-quality initial population. In addition, a statistical learning-assisted local search method is developed, employing six tailored search operators and Thompson sampling to adaptively select promising operators during the evolutionary algorithm (EA) process. Extensive experiments demonstrate that the proposed DT-statistical learning EA (DT-SLEA) significantly improves scheduling performance compared with state-of-the-art algorithms, highlighting the effectiveness of integrating digital twin and statistical learning techniques for shop scheduling problems. Specifically, in the Wilcoxon test, pairwise comparisons with the other algorithms show that DT-SLEA has p-values below 0.05. Meanwhile, the proposed framework provides guidance on utilizing symmetry to improve optimization in complex manufacturing systems.

  • Research Article
  • 10.1080/23335777.2025.2561607
Enhancing off-policy optimisation in structured Markov decision processes via Thompson Sampling
  • Sep 27, 2025
  • Cyber-Physical Systems
  • Sourav Ganguly

ABSTRACT Reinforcement Learning (RL) provides a framework for solving sequential decision-making tasks, yet limited and potentially unsafe data hinder its training process. Off-policy algorithms are commonly employed to mitigate these issues, but they face challenges in data-scarce or non-ergodic environments and often exhibit exploding variance over long trajectories. We propose a novel algorithm that integrates Thompson Sampling, originally developed for multi-armed bandit problems, to enable efficient and safe policy identification. By exploiting the structural properties of Structured Markov Decision Processes (SMDPs), our approach reduces the policy search space, enhances learning stability, and demonstrates superior performance compared to Q-learning, SOCU, and SOCU-v2.

  • Research Article
  • 10.3390/ai6090209
QiMARL: Quantum-Inspired Multi-Agent Reinforcement Learning Strategy for Efficient Resource Energy Distribution in Nodal Power Stations
  • Sep 1, 2025
  • AI
  • Sapthak Mohajon Turjya + 3 more

The coupling of quantum computing with multi-agent reinforcement learning (MARL) provides an exciting direction to tackle intricate decision-making tasks in high-dimensional spaces. This work introduces a new quantum-inspired multi-agent reinforcement learning (QiMARL) model, utilizing quantum parallelism to achieve learning efficiency and scalability improvement. The QiMARL model is tested on an energy distribution task, which optimizes power distribution between generating and demanding nodal power stations. We compare the convergence time, reward performance, and scalability of QiMARL with traditional Multi-Armed Bandit (MAB) and Multi-Agent Reinforcement Learning methods, such as Greedy, Upper Confidence Bound (UCB), Thompson Sampling, MADDPG, QMIX, and PPO methods with a comprehensive ablation study. Our findings show that QiMARL yields better performance in high-dimensional systems, decreasing the number of training epochs needed for convergence while enhancing overall reward maximization. We also compare the algorithm’s computational complexity, indicating that QiMARL is more scalable to high-dimensional quantum environments. This research opens the door to future studies of quantum-enhanced reinforcement learning (RL) with potential applications to energy optimization, traffic management, and other multi-agent coordination problems.

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.sciaf.2025.e02798
DTS-TSCH: Adaptive Thompson sampling for channel estimation in TSCH networks for IIoT
  • Sep 1, 2025
  • Scientific African
  • Adugna Necho Mulatu + 3 more

DTS-TSCH: Adaptive Thompson sampling for channel estimation in TSCH networks for IIoT

  • Research Article
  • 10.55670/fpll.futech.4.3.24
Dynamic reward systems and customer loyalty: reinforcement learning-optimized personalized service strategies
  • Aug 15, 2025
  • Future Technology
  • Xiaojing Nie + 1 more

Traditional customer loyalty programs employing static reward structures demonstrate fundamental limitations in adapting to evolving customer preferences and behaviors within digital commerce environments. This research addresses the critical gap in personalization capabilities by developing a reinforcement learning (RL)-based dynamic reward system that optimizes customer engagement through real-time adaptive reward allocation mechanisms. The investigation centers on designing and validating an intelligent system capable of automatically adjusting reward types, values, and timing parameters based on continuous analysis of individual customer interactions and feedback patterns. The proposed methodology implements a multi-armed bandit framework utilizing Thompson Sampling algorithms integrated with contextual learning mechanisms, thereby achieving an optimal balance between exploration and exploitation in reward optimization processes. Comprehensive experimental simulations compare the RL-based approach against traditional rule-based systems and random allocation strategies across five distinct customer segments, enabling robust performance evaluation under diverse operational conditions. Empirical results demonstrate that the RL-based system achieves 145% of baseline customer lifetime value (CLV), representing a 45% improvement over traditional methods, accompanied by corresponding enhancements in retention rate (32%) and engagement frequency (28%). The system maintains robust performance under budget constraints, sustaining 118% of baseline CLV despite a 30% budget reduction, with statistical analysis confirming significant improvements across all metrics (p < 0.001, Cohen's d > 1.7). These findings provide organizations with a scalable framework for implementing adaptive loyalty programs that respond dynamically to customer preferences while optimizing resource allocation efficiency. The research contributes to the expanding literature on AI-driven customer relationship management by demonstrating the practical effectiveness of reinforcement learning in personalization contexts.

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • .
  • .
  • .
  • 10
  • 1
  • 2
  • 3
  • 4
  • 5

Popular topics

  • Latest Artificial Intelligence papers
  • Latest Nursing papers
  • Latest Psychology Research papers
  • Latest Sociology Research papers
  • Latest Business Research papers
  • Latest Marketing Research papers
  • Latest Social Research papers
  • Latest Education Research papers
  • Latest Accounting Research papers
  • Latest Mental Health papers
  • Latest Economics papers
  • Latest Education Research papers
  • Latest Climate Change Research papers
  • Latest Mathematics Research papers

Most cited papers

  • Most cited Artificial Intelligence papers
  • Most cited Nursing papers
  • Most cited Psychology Research papers
  • Most cited Sociology Research papers
  • Most cited Business Research papers
  • Most cited Marketing Research papers
  • Most cited Social Research papers
  • Most cited Education Research papers
  • Most cited Accounting Research papers
  • Most cited Mental Health papers
  • Most cited Economics papers
  • Most cited Education Research papers
  • Most cited Climate Change Research papers
  • Most cited Mathematics Research papers

Latest papers from journals

  • Scientific Reports latest papers
  • PLOS ONE latest papers
  • Journal of Clinical Oncology latest papers
  • Nature Communications latest papers
  • BMC Geriatrics latest papers
  • Science of The Total Environment latest papers
  • Medical Physics latest papers
  • Cureus latest papers
  • Cancer Research latest papers
  • Chemosphere latest papers
  • International Journal of Advanced Research in Science latest papers
  • Communication and Technology latest papers

Latest papers from institutions

  • Latest research from French National Centre for Scientific Research
  • Latest research from Chinese Academy of Sciences
  • Latest research from Harvard University
  • Latest research from University of Toronto
  • Latest research from University of Michigan
  • Latest research from University College London
  • Latest research from Stanford University
  • Latest research from The University of Tokyo
  • Latest research from Johns Hopkins University
  • Latest research from University of Washington
  • Latest research from University of Oxford
  • Latest research from University of Cambridge

Popular Collections

  • Research on Reduced Inequalities
  • Research on No Poverty
  • Research on Gender Equality
  • Research on Peace Justice & Strong Institutions
  • Research on Affordable & Clean Energy
  • Research on Quality Education
  • Research on Clean Water & Sanitation
  • Research on COVID-19
  • Research on Monkeypox
  • Research on Medical Specialties
  • Research on Climate Justice
Discovery logo
FacebookTwitterLinkedinInstagram

Download the FREE App

  • Play store Link
  • App store Link
  • Scan QR code to download FREE App

    Scan to download FREE App

  • Google PlayApp Store
FacebookTwitterTwitterInstagram
  • Universities & Institutions
  • Publishers
  • R Discovery PrimeNew
  • Ask R Discovery
  • Blog
  • Accessibility
  • Topics
  • Journals
  • Open Access Papers
  • Year-wise Publications
  • Recently published papers
  • Pre prints
  • Questions
  • FAQs
  • Contact us
Lead the way for us

Your insights are needed to transform us into a better research content provider for researchers.

Share your feedback here.

FacebookTwitterLinkedinInstagram
Cactus Communications logo

Copyright 2026 Cactus Communications. All rights reserved.

Privacy PolicyCookies PolicyTerms of UseCareers