Double Buffers CEM-TD3: More Efficient Evolution and Richer Exploration

Sheng Zhu,Junhong Wu,Chun Shen,Shuai Lü,Daolong An

doi:10.1609/aaai.v38i15.29665

Sheng Zhu, Junhong Wu + Show 3 more

Open Access

PDF Available

https://doi.org/10.1609/aaai.v38i15.29665

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

CEM-TD3 is a combination scheme using the simple cross-entropy method (CEM) and Twin Delayed Deep Deterministic policy gradient (TD3), and it achieves a satisfactory trade-off between performance and sample efficiency. However, we find that CEM-TD3 cannot fully address the low efficiency of policy search caused by CEM, and the policy gradient learning introduced by TD3 will weaken the diversity of individuals in the population. In this paper, we propose Double Buffers CEM-TD3 (DBCEM-TD3) that optimizes both CEM and TD3. For CEM, DBCEM-TD3 maintains an actor buffer to store the population required for evolution. In each iteration, it only needs to generate a small number of actors to replace the poor actors in the policy buffer to achieve more efficient evolution. The fitness of individuals in the actor buffer decreases exponentially with time, which can avoid premature convergence of the mean actor. For TD3, DBCEM-TD3 maintains a critic buffer with the same number of critics as the number of actors generated in each iteration, and each critic is trained independently by sampling from the shared replay buffer. In each iteration, each newly generated actor uses different critics to guide learning. This ensures more diverse behaviors among the learned actors, enabling richer experiences to be collected during the evaluation phase. We conduct experimental evaluations on five continuous control tasks provided by OpenAI Gym. DBCEM-TD3 outperforms CEM-TD3, TD3, and other classic off-policy reinforcement learning algorithms in terms of performance and sample efficiency.

Full Text