Abstract

Evolutionary strategies (ES) and off-policy learning algorithms are two major workhorses of Reinforcement learning (RL): ES adopt a simple blackbox approach to optimization but it can be slightly more sample inefficient; off-policy learning is by design more sample efficient but the updates can be unstable. Motivated by their trade-offs, we propose CEM-ACER, a combination of Cross-entropy method, a standard ES algorithm, and Actor-critic with experience replay (ACER), an off-policy actor-critic algorithm. Our proposal relies on a key insight: off-policy algorithms provide a natural mechanism to efficiently evolve parameter populations as part of an ES algorithm. Across a wide range of benchmark control tasks, we show that CEM-ACER balances the strengths of CEM and ACER, leading to an algorithm that consistently outperforms its individual building blocks, as well as other competitive baseline algorithms.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.