Average-Reward Reinforcement Learning

Prasad Tadepalli

doi:10.1007/978-1-4899-7502-7_17-1

Abstract

Reinforcement learning (RL) is the study of programs that improve their performance at some task by receiving rewards and punishments from the environment (Sutton and Barto 1998). RL has been quite successful in the automatic learning of good procedures for complex tasks such as playing Backgammon and scheduling elevators (Tesauro 1992; Crites and Barto 1998). In episodic domains in which there is a natural termination condition such as the end of the game in Backgammon, the obvious performance measure to optimize is the expected total reward per episode. But some domains such as elevator scheduling are recurrent, i.e., do not have a natural termination condition. In such cases, total expected reward can be infinite, and we need a different optimization criterion. In the discounted optimization framework, in each time step, the value of the reward is multiplied by a discount factor < 1, so that the total discounted reward is always finite. However, in many domains, there is no natural interpretation for the discount factor . A natural performance measure to optimize in such domains is the average reward received per time step. Although one could use a discount factor which is close to 1 to approximate average-reward optimization, an approach that directly optimizes the average reward avoids this additional parameter and often leads to faster convergence in practice. There is a significant theory behind average-reward optimization based on Markov decision processes (MDPs) (Puterman 1994). An MDP is described by a 4-tuple hS;A; P; ri, where S is a discrete set of states and A is a discrete set of actions. P is a conditional probability distribution over the next states, given the current state and action, and r gives the immediate reward for a given state and action. A policy is a mapping from states to actions. Each policy induces a Markov process over some set of states. In ergodic MDPs, every policy forms a single closed set

Full Text