Hierarchical Reinforcement Learning

Magnus Borga

doi:10.1007/978-1-4471-2063-6_139

Abstract

A response generating system can be seen as a mapping from a set of external states (inputs) to a set of actions (outputs). This mapping can be done in principally different ways. One method is to divide the state space into a set of discrete states and store the optimal response for each state. This is denominated a memory mapping system. Another method is to approximate continuous functions from the input space to the output space. I denominate this method projective mapping, although the function does not have to be linear. The latter method is the most common one in feed forward neural networks, where an input vector is projected on a “weight vector”.In reality this mapping is piecewise continuous. Consider the task of moving from one point to another. There is often an infinite number of solutions, and if two different paths are selected, it is often possible to interpolate these paths and get a new path that lies somewhere in between. On the other hand, if you for instance are passing a tree, you can choose to walk at the right side or the left side of the tree, but there is no success in trying to interpolate these two choices. There are at the same time an infinite number of ways to move on either side of the tree. This implies that there are two different kinds of responses, one that can be interpolated and another that can not.In this paper the discrete set of alternative responses where there is no meaning in interpolation is defined as a set of strategies, and the continuous possibilities of outputs within each strategy are defined as actions. In the previous example the choice of which side of the tree to pass on is the choice between the two strategies, left and right, and the decision of which specific path to walk is a decision of a sequence of actions.The input space is divided into a finite number of regions, where for each region there is one best choice of strategy and the input-output transition function varies in a smooth way. The total input-output transition function of the system is a set of continuous input-output transition functions, one for each strategy.Now the problem of learning the total input-output transition function can be divided into several learning problems at two different levels. At the first level, there are a number of smooth and continuous functions to be learned, one for each strategy. For a given strategy an “ordinary” reinforcement learning problem is faced, where projective mapping could be used. At the second level, the best choice among the set of strategies is the learning task. The number of strategies are, however, finite, so at this level a memory mapping method could be efficient.This paper describes how adaptive critic methods could be used to handle a two-level reinforcement learning problem. In adaptive critic methods predictions of future reinforcement are made, and these predictions are used to select strategy. For each strategy, a prediction of future reinforcement and an action are calculated. Then the strategy with the highest prediction is selected and the corresponding action is used as output. The update of the elements in the reinforcement association vectors and the action association vectors is made only for the chosen strategy and hence the internal reinforcement signal depends on which strategy was chosen. In this way the structural credit assignment problem is reduced considerably.An example of an algorithm that solves a two-level learning task for a one-dimensional dynamic problem is presented. The system uses two strategies, each containing a linear action function. The choice of strategy and the proper actions are learned simultaneously by reinforcement learning.

Full Text