Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies

Erick Asiain,Alexander S Poznyak,Julio B Clempner

doi:10.1007/s00500-018-3225-7

Abstract

This paper suggests a new controller exploitation-exploration (CEE) reinforcement learning (RL) architecture that attains a near-optimal policy. The proposed architecture consists of three modules: controller, fast-tracked learning and the actor-critic. The strategies are represented by a probability distribution $$c_{ik}$$ . The controller employs a combination (balance) of the exploration or exploitation using the Kullback–Leibler divergence deciding if the new strategies are better than currently employed immediate strategy. The exploitation uses a fast-tracked learning algorithm, which employs a fix strategy and priori knowledge. The method is (only) asked to find estimated values of the transition matrices and utilities. The exploration employs an actor-critic architecture. The actor is responsible for the computation of the strategies using a policy gradient method. The critic determines the acceptance of the proposed strategies. We show the convergence of the proposed algorithms for implementing the architecture. An application example related to inventory shows the effectiveness of the proposed architecture.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies

Abstract

Talk to us

Similar Papers

More From: Soft Computing

Lead the way for us

Journal: Soft Computing	Publication Date: May 10, 2018
Citations: 18

Similar Papers

Exploring the use of AI in marine acoustic sensor management
Edward Clark ... Alan Hunter
-
Edward Clark, et. al.Edward Clark ... Alan Hunter
01 Jan 2021
01 Jan 2021

Implicit Estimation of Another's Intention Based on Modular Reinforcement Learning
...
-
, et. al. ...
01 Jan 2009
01 Jan 2009

A distributed adaptive policy gradient method based on momentum for multi-agent reinforcement learning
Junru Shi ... Qingtao Wu
Complex & Intelligent Systems | VOL. 10
Junru Shi, et. al.Junru Shi ... Qingtao Wu
12 Jul 2024
Complex & Intelligent Systems | VOL. 10

Parallel bandit architecture based on laser chaos for reinforcement learning
Takashi Urushibara ... Satoshi Kochi
Journal of Physics Communications | VOL. 6
Takashi Urushibara, et. al.Takashi Urushibara ... Satoshi Kochi
01 Jun 2022
Journal of Physics Communications | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies

Abstract

Talk to us

Similar Papers

More From: Soft Computing