Abstract

The objective of this paper is to accelerate the process of policy improvement in reinforcement learning. The proposed Dyna-style system combines two learning schemes, one of which utilizes a temporal difference method for direct learning; the other uses relative values for indirect learning in planning between two successive direct learning cycles. Instead of establishing a complicated world model, the approach introduces a simple predictor of average rewards to actor-critic architecture in the simulation (planning) mode. The relative value of a state, defined as the accumulated differences between immediate reward and average reward, is used to steer the improvement process in the right direction. The proposed learning scheme is applied to control a pendulum system for tracking a desired trajectory to demonstrate its adaptability and robustness. Through reinforcement signals from the environment, the system takes the appropriate action to drive an unknown dynamic to track desired outputs in few learning cycles. Comparisons are made between the proposed model-free method, a connectionist adaptive heuristic critic, and an advanced method of Dyna-Q learning in the experiments of labyrinth exploration. The proposed method outperforms its counterparts in terms of elapsed time and convergence rate.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call