Kernel Recursive Least-Squares Temporal Difference Algorithms with Sparsification and Regularization.

Chunyuan Zhang,Xinzheng Niu,Qingxin Zhu

doi:10.1155/2016/2305854

Abstract

By combining with sparse kernel methods, least-squares temporal difference (LSTD) algorithms can construct the feature dictionary automatically and obtain a better generalization ability. However, the previous kernel-based LSTD algorithms do not consider regularization and their sparsification processes are batch or offline, which hinder their widespread applications in online learning problems. In this paper, we combine the following five techniques and propose two novel kernel recursive LSTD algorithms: (i) online sparsification, which can cope with unknown state regions and be used for online learning, (ii) L 2 and L 1 regularization, which can avoid overfitting and eliminate the influence of noise, (iii) recursive least squares, which can eliminate matrix-inversion operations and reduce computational complexity, (iv) a sliding-window approach, which can avoid caching all history samples and reduce the computational cost, and (v) the fixed-point subiteration and online pruning, which can make L 1 regularization easy to implement. Finally, simulation results on two 50-state chain problems demonstrate the effectiveness of our algorithms.

Highlights

The least-squares temporal difference (LSTD) learning may be the most popular approach for policy evaluation in reinforcement learning (RL) [1, 2]
We propose two online sparse kernel RLSTD (SKRLSTD) algorithms with L2 and L1 regularization, called OSKRLSTD-L2 and OSKRLSTD-L1, respectively
An Markov decision process (MDP) can be defined as a tuple M = ⟨S, A, P, r, γ, d⟩ [5], where S is a set of states, A is a set of actions, P : S × A × S → [0, 1] is a state transition probability function where P(s, a, s󸀠) denotes the probability of transitioning to state s󸀠 when taking action a in state s, r ∈ R is a reward function, γ ∈ [0, 1]

Summary

Introduction

The least-squares temporal difference (LSTD) learning may be the most popular approach for policy evaluation in reinforcement learning (RL) [1, 2]. When the number of features is larger than the number of training samples, LSTD is prone to overfitting To overcome this problem, Kolter and Ng proposed an L1-regularized LSTD algorithm called LARS-TD for feature selection [5], but it is only applicable for batch learning and its implementation is complicated. Xu proposed a sparse kernel-based LSTD(λ) (SKLSTD(λ)) algorithm with the ALD criterion [19]. This algorithm can avoid selecting features manually, it is only applicable for batch learning and its derivation is complicated.

Background

Regularized OSKRLSTD Algorithms

Simulations

Conclusion