Least-squares temporal difference learning based on an extreme learning machine

Pablo Escandell-Montero,José M Martínez-Martínez,José D Martín-Guerrero,Emilio Soria-Olivas,Juan Gómez-Sanchis

doi:10.1016/j.neucom.2013.11.040

Pablo Escandell-Montero, José M Martínez-Martínez + Show 3 more

https://doi.org/10.1016/j.neucom.2013.11.040

Copy DOI

Journal: Neurocomputing	Publication Date: Apr 5, 2014
Citations: 19

Affiliation: University of Valencia

Abstract

Reinforcement learning (RL) is a general class of algorithms for solving decision-making problems, which are usually modeled using the Markov decision process (MDP) framework. RL can find exact solutions only when the MDP state space is discrete and small enough. Due to the fact that many real-world problems are described by continuous variables, approximation is essential in practical applications of RL. This paper is focused on learning the value function of a fixed policy in continuous MPDs. This is an important subproblem of several RL algorithms. We propose a least-squares temporal difference (LSTD) algorithm based on the extreme learning machine. LSTD is typically combined with local function approximators, which scale poorly with the problem dimensionality. Our approach allows us to approximate value functions using single-hidden layer feedforward networks (SLFNs), a type of artificial neural network extensively used in many fields. Due to the global nature of SLFNs, the proposed approach is more suitable than traditional methods for high-dimensional problems. The method was empirically evaluated on a set of MDPs whose dimensionality varies from 1 to 6. For comparison purposes, experiments were replicated using a standard LSTD algorithm combined with Gaussian radial basis functions. Experimental results suggest that, although both methods can approximate accurately value functions, the proposed approach requires considerably fewer resources for the same degree of accuracy.

Full Text