Abstract

The task of learning the value function under a fixed policy in continuous Markov decision processes (MDPs) is considered. Although ELM has fast learning speed and can avoid tuning issues of traditional artificial neural network (ANN), the randomness of the ELM parameters would result in fluctuating performance. In this paper, a least-squares temporal difference algorithm with eligibility traces based on regularized extreme learning machine (RELM-LSTD(X)) is proposed to overcome these problems caused by ELM in Reinforcement Learning problem. The proposed algorithm combined the LSTD(X) algorithm with RELM. The RELM is used to approximate value functions. Furthermore, the eligibility trace term is introduced to increase data efficiency. In experiments, the performances of the proposed algorithm are demonstrated and compared with those of LSTD and ELM-LSTD. Experiment results show that the proposed algorithm can achieve a more stable and better performance in approximating the value function under a fixed policy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call