Abstract

This paper aims to solve sample inefficiency in Asynchronous Advantage Actor-Critic (A3C). First, we design a new off-policy actor-critic algorithm, which combines actor-critic with experience replay to improve sample efficiency. Next, we study the sampling method of experience replay for trajectory experience and propose a familiarity-based replay mechanism which uses the number of replay times of experience as the probability weight of sampling. Finally, we use the GAE-V method to correct the bias caused by off-policy learning. We also achieve better performance by adopting a mechanism that combines off-policy learning and on-policy learning to update the network. Our results on Atari and MuJoCo benchmarks show that each of these innovations contributes to improvements in both data efficiency and final performance. Furthermore, our approach keeps a fast coverage speed and the same parallel feature as A3C, and also has better performance on exploration.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call