Abstract
The learning process in reinforcement learning is time-consuming because on early episodes agent relies too much on exploration. The proposed “coaching” approach focused on helping to accelerate learning for the system with a sparse environmental reward setting. This approach works well with linear epsilon-greedy Q-learning with eligibility traces. To coach an agent, an intermediate target is given by a human coach as a sub-goal for the agent to pursue. This sub-goal provides an additional clue that guides the agent toward the actual terminal state. In the coaching phase, the agent pursues an intermediate target with an aggressive policy. The aggressive reward from this intermediate target would not be used to update the state-action value directly but the environmental reward is used. After a small number of coaching episodes, the learning would proceed normally with an $$\epsilon $$-greedy policy. In this way, the agent will end up with an optimal policy which is not under influence or supervision of a human coach. The proposed method has been tested on three experimental tasks: mountain car, ball following, and obstacle avoidance. Even with the human coach of various skill levels, the experimental results show that this method could speed up the learning process of an agent in all tasks.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have