The second order temporal difference error for Sarsa(&amp;#x03BB;)

Qiming Fu,Fei Xiao,Guixin Chen,Quan Liu

doi:10.1109/adprl.2013.6614990

Abstract

Traditional reinforcement learning algorithms, such as Q-learning, Q(λ), Sarsa, and Sarsa(λ), update the action value function using temporal difference (TD) error, which is computed by the last action value function. From the perspective of the TD error, and with respect to the problems of low efficiency and slow convergence of the traditional Sarsa(λ) algorithm, this paper defines the nth order TD Error, applies it in the traditional Sarsa(λ) algorithm, and develops a fast Sarsa(λ) algorithm based on the 2nd order TD Error. The algorithm adjusts the Q value with the second-order TD Error and broadcasts the TD Error into the whole state-action space, which speeds up the convergence of the algorithm. This paper also analyzes the convergence rate, and under the condition of one-step update, the results show that the number of iteration depends primarily on γ, e. Finally, using the proposed algorithm on the traditional reinforcement learning problems, the results show that the algorithm has both a faster convergence rate and better convergence performance.

Full Text