Fast Iterative model for Sequential-Selection-Based Applications

Khosrow Amirizadeh,Rajeswari Mandava

doi:10.24297/ijct.v12i7.3092

Khosrow Amirizadeh, Rajeswari Mandava

Open Access

https://doi.org/10.24297/ijct.v12i7.3092

Copy DOI

Abstract

Accelerated multi-armed bandit (MAB) model in Reinforcement-Learning for on-line sequential selection problems is presented. This iterative model utilizes an automatic step size calculation that improves the performance of MAB algorithm under different conditions such as, variable variance of reward and larger set of usable actions. As result of these modifications, number of optimal selections will be maximized and stability of the algorithm under mentioned conditions may be amplified. This adaptive model with automatic step size computation may attractive for on-line applications in which,Â variance of observations vary with time and re-tuning their step size are unavoidable where, this re-tuning is not a simple task. The proposed model governed by upper confidence bound (UCB) approach in iterative form with automatic step size computation. It called adaptive UCB (AUCB) that may use in industrial robotics, autonomous control and intelligent selection or prediction tasks in the economical engineering applications under lack of information.

Highlights

A growing number of models in autonomous and adaptive control applications operating based on intelligent learning approaches to make “sequential decisions” tasks
This study aims to evaluate upper confidence bound (UCB) approach of multi-armed bandit (MAB) model under different conditions and present an “iterative MAB algorithm” based on UCB approach to minimize the mentioned limitations
Some comparisons with different settings have been conducted to show the performance adaptive UCB (AUCB) under variable observations, whereas similar models degrade under these conditions

Summary

Introduction

A growing number of models in autonomous and adaptive control applications operating based on intelligent learning approaches to make “sequential decisions” tasks These prove a truly fundamental enhance from traditional control process to intelligent approaches. These approaches should be able to perform sequential decision making with long control horizons that the exploration and exploitation trade-off is inherently considered Subjects such as “iterative learning control and reinforcement learning” in adaptive control and robotics, autonomous agents and intelligent decision making have widely developed. The decision maker faces a row of these options, without any extra knowledge to indicate the prominent one, and decides which one must be selected such that, the total reward is maximized Maximizing this cumulative reward is equivalent to minimizing the regret, the difference between true cumulative reward and sum of so far rewards relating to the best selection at each round. After choosing ka times an action a, instant estimation of “actual value”, V∗(a) at step k is obtained through the sample-mean equation: Vk a

Objectives

Results

Conclusion