Abstract

Prioritized experience replay has been widely used in many online reinforcement learning algorithms, providing high efficiency in exploiting past experiences. However, a large replay buffer consumes system storage significantly. Thus, in this paper, a segmentation and classification scheme is proposed. The distribution of temporal-difference errors (TD errors) is first segmented. The experience for network training is classified according to its updated TD error. Then, a swap mechanism for similar experiences is implemented to change the lifetimes of experiences in the replay buffer. The proposed scheme is incorporated in the Deep Deterministic Policy Gradient (DDPG) algorithm, and the Inverted Pendulum and Inverted Double Pendulum tasks are used for verification. From the experiments, our proposed mechanism can effectively remove the buffer redundancy and further reduce the correlation of experiences in the replay buffer. Thus, better learning performance with reduced memory size is achieved at the cost of additional computations of updated TD errors.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call