Abstract

Accelerated multi-armed bandit (MAB) model in Reinforcement-Learning for on-line sequential selection problems is presented. This iterative model utilizes an automatic step size calculation that improves the performance of MAB algorithm under different conditions such as, variable variance of reward and larger set of usable actions. As result of these modifications, number of optimal selections will be maximized and stability of the algorithm under mentioned conditions may be amplified. This adaptive model with automatic step size computation may attractive for on-line applications in which, variance of observations vary with time and re-tuning their step size are unavoidable where, this re-tuning is not a simple task. The proposed model governed by upper confidence bound (UCB) approach in iterative form with automatic step size computation. It called adaptive UCB (AUCB) that may use in industrial robotics, autonomous control and intelligent selection or prediction tasks in the economical engineering applications under lack of information.

Highlights

  • A growing number of models in autonomous and adaptive control applications operating based on intelligent learning approaches to make “sequential decisions” tasks

  • This study aims to evaluate upper confidence bound (UCB) approach of multi-armed bandit (MAB) model under different conditions and present an “iterative MAB algorithm” based on UCB approach to minimize the mentioned limitations

  • Some comparisons with different settings have been conducted to show the performance adaptive UCB (AUCB) under variable observations, whereas similar models degrade under these conditions

Read more

Summary

Introduction

A growing number of models in autonomous and adaptive control applications operating based on intelligent learning approaches to make “sequential decisions” tasks These prove a truly fundamental enhance from traditional control process to intelligent approaches. These approaches should be able to perform sequential decision making with long control horizons that the exploration and exploitation trade-off is inherently considered Subjects such as “iterative learning control and reinforcement learning” in adaptive control and robotics, autonomous agents and intelligent decision making have widely developed. The decision maker faces a row of these options, without any extra knowledge to indicate the prominent one, and decides which one must be selected such that, the total reward is maximized Maximizing this cumulative reward is equivalent to minimizing the regret, the difference between true cumulative reward and sum of so far rewards relating to the best selection at each round. After choosing ka times an action a, instant estimation of “actual value”, V∗(a) at step k is obtained through the sample-mean equation: Vk a

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call