Solving POMDPs with Automatic Discovery of Subgoals

Le Dung ,Takashi Komeda ,Masatoshi Takagi

doi:10.5772/6685

Abstract

Reinforcement Learning (RL) is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment (Kaelbling et al., 1996). At any time step, the environment is assumed to be at one state. In Markov Decision Processes (MPDs), all states are fully observable in which the agent can choose a good action based only on the current sensory observation. In Partially Observable Markov Decision Processes (POMDPs), any state can be a hidden state in which the agent doesn’t have sufficient sensory observation and the agent must remember the past sensations to select a good action. Q-learning is the most popular algorithm for learning from delayed reinforcement in MDPs, and RL with Recurrent Neural Network (RNN) can solve deep POMDPs. Several methods have been proposed to speed up learning performance in MDPs by creating useful subgoals (Girgin et al., 2006), (McGovern & Barto, 2001), (Menache et al., 2002), (Simsek & Barto, 2005). Subgoals are actually states that have a high reward gradient or that are visited frequently on successful trajectories but not on unsuccessful ones, or that lie between densely-connected regions of the state space. In MDPs, to attaint a subgoal, we can use a plain table based policy, named a skill. Then these useful skills are treated as options or macro actions in RL (Barto & Mahadevan, 2003), (McGovern & Barto, 2001), (Menache et al., 2002), (Girgin et al., 2006), (Simsek & Barto, 2005), (Sutton et al., 1999). For example, an option named “going to the door” helps a robot to move from any random position in the hall to one of two doors. However, it is difficult to apply directly this approach to RL when a RNN is used to predict Q values. Simply adding one more unit into output layer to predict Q values for an option doesn’t work because updating any connection’s weight will affect all previous Q values and because it is easy to lose the Q values when the option can’t be executed for a long time. In this chapter, a method named Reinforcement Learning using Automatic Discovery of Subgoals is presented towards this approach but in POMDPs. We can reuse existing algorithms to discover subgoals. To obtain a skill, a new policy using a RNN is trained by experience replay. Once useful skills are obtained by RNNs, these learned RNNs are integrated into the main RNN as experts in RL. Results of experiment in two problems, the E maze problem and the virtual office problem, show that the proposed method enables an agent to acquire a policy, as good as the policy acquired by RL with RNN, with better learning performance.

Full Text