Abstract

is a key problem, these mentioned above approaches are build on the decomposition of the state space. In these approaches, the state space is decomposed according to the prior problem domain knowledge with the assistance of human, but that would make the decomposition is problem-specific, and also destroy the agent’s autonomy: the most merit of RL. In this work, we propose a heuristic method to isolate the larger state space into some smaller state spaces for decomposing learning task. When we record every training episode and remove the state loops from it to be acyclic episode. We found some states appear in most of acyclic episodes, even appear in all acyclic episodes. That is, these states have high probability (even to 1) to appear in all acyclic episodes. And some of these states are also the gates for agent can move from a part of state space enter another part of state space. This two characteristics are simultaneous processed by some of these states. That means if agent wants to reach goal state, then to pass these states are critical. We call these critical state as gate states. In real world, this situations are particular true, for example, before the state of ”downpour”, the state of ”darkly clouded” appears often. When these gate states are blocked, the whole state space is isolated naturally into some smaller state spaces. Our method is not very sophisticated, but is truly effective. When the target state space are decomposed, the improvement of learning is indubitable. The more important is our work makes agent processing the capability to decomposing its learning task according to what it learned, which keeps agent’s autonomy. We may not ensure this method can isolate the state space completely, that is to find every gate state and block it, because the method is built on the learned episodes. When the training episodes are scarce, it is possible that agent ”escape” from a small state space to enter another small state space from some unfound gate state. But the experiments on GridWorld problem show the isolation tend to be complete along with the increase of training episodes.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call