Dyna-like reinforcement learning based on accumulative and average rewards

Kao-Shing Hwang Kao-Shing Hwang,Chia-Yue Lo Chia-Yue Lo

doi:10.1109/icsmc.2010.5642415

Abstract

An approach to accelerating the learning process of the actor-critic learning algorithm for reinforcement learning is presented. The algorithm was derived from principles based on the prediction of average rewards and temporal difference (TD) learning with averaged and discounted rewards. The derived algorithm was applied to neural networks, demonstrating their effective operation in nonlinear control problems. The motivation of the proposed algorithm was to elaborate how a learning scheme, implemented by artificial neural networks (ANNs), can speed up learning processes based on an arrangement akin to the Dyna-Q learning, where a simulative model of the controlled plant is established for virtual learning between two control cycles. Instead of modeling the complicated plant, the approach just introduced a simple predictor of rewards for virtual learning in simulation mode. Two TD learning methods based discounted and averaged rewards respectively, are used alternatively in the control and simulation mode to facilitate the derived algorithm. The proposed Alternative Learning Critic (ALC) algorithm consists of two sub-systems: one is Evaluation Predictor (EP), which performs an approximation of a long-term evaluation function, and the other is an immediate action selector, which is composed of two ANNs: Action Controller (AC) and Reinforcement Predictor (RP). The proposed learning scheme is then applied to control a pendulum system for tracking a desired trajectory to demonstrate its applausive performance and robustness. Through reinforcement signals from the environment, the system takes an appropriate action to a plant with unknown dynamics so the actual output of the plant can track the desired one concisely within a short learning cycles. Further, ALC is used as the compensator of a PI controller, which is actually only working well on a linear system, to control that pendulum system. The results show the affined system, trained ALC and the PI controller can manipulate together on a nonlinear system with unknown dynamics.

Full Text