Proposal and Evaluation of the Improved Penalty Avoiding Rational Policy Making Algorithm

Kazuteru Miyazaki ,Takuji Namatame ,Hiroaki Kobayashi

doi:10.5772/6682

Abstract

Reinforcement learning (RL) is a kind of machine learning. It aims to adapt an agent to a given environment with a reward and a penalty. Traditional RL systems are mainly based on the Dynamic Programming (DP). They can get an optimum policy that maximizes an expected discounted reward in Markov Decision Processes (MDPs). We know Temporal Difference learning (Sutton, 1988) and Q-learning (Watkins, 1992) as a kind of the DP-based RL systems. They are very attractive since they are able to guarantee the optimality in MDPs. We know that Partially Observable Markov Decision Processes (POMDPs) classes are wider than MDPs. If we apply the DP-based RL systems to POMDPs, we will face some limitation. Hence, a heuristic eligibility trace is often used to treat a POMDP. We know TD(λ) (Sutton, 1988), Sarsa(λ) (Singh & Sutton, 1996), (Sutton & Barto, 1998) and Actor-Critic (Kimura & Kobayashi, 1998) as such kinds of RL systems. The DP-based RL system aims to optimize its behavior under given reward and penalty values. However, it is difficult to design these values appropriately for the purpose of us. If we set inappropiate values, the agent may learn unexpected behavior (Miyazaki & Kobayashi, 2000). We know the Inverse Reinforcement Learning (IRL) (Ng & Russell, 2000) as a method related to the design problem of reward and penalty values. If we input an expected policy to the IRL systems, it can output a reward function that can realize just the same policy. IRL has several theoretical results, i.e. apprenticeship learning (Abbeel & Ng, 2005) and policy invariance (Ng et.al., 1999). On the other hand, we are interested in the approach where a reward and a penalty are treated independently. As examples of RL systems that we are proposed on the basis of the viewpoint, we know the rationality theorem of Profit Sharing (PS) (Miyazaki et.al., 1994), the Rational Policy Making algorithm (RPM) (Miyazaki & Kobayashi, 1998) and PS-r* (Miyazaki & Kobayashi, 2003). They are restricted to the environment where the number of types of a reward is one. Furthermore, we know the Penalty Avoiding Rational Policy Making algorithm (PARP) (Miyazaki & Kobayashi, 2000) and the Penalty Avoiding Profit Sharing (PAPS) (Miyazaki et.al., 2002) as examples of RL systems that are able to treat a penalty, too. We call these systems Exploitaion-oriented Learning (XoL). O pe n A cc es s D at ab as e w w w .in te ch w eb .o rg

Full Text