Abstract
Using the same algorithm and hyperparameter configurations, deep reinforcement learning (DRL) will derive drastically different results from multiple experimental trials, and most of these results are unsatisfactory. Because of the instability of the results, researchers have to perform many trials to confirm an algorithm or a set of hyperparameters in DRL. In this article, we present the policy return method, which is a new design for reducing the number of trials when training a DRL model. This method allows the learned policy to return to a previous state when it becomes divergent or stagnant at any stage of training. When returning, a certain percentage of stochastic data is added to the weights of the neural networks to prevent a repeated decline. Extensive experiments on challenging tasks and various target scores demonstrate that the policy return method can bring about a 10% to 40% reduction in the required number of trials compared with that of the corresponding original algorithm, and a 10% to 30% reduction compared with the state-of-the-art algorithms.
Highlights
D EEP reinforcement learning is a combination of deep learning and reinforcement learning, enabling it to possess both the perception ability of deep learning and the policy-making ability of reinforcement learning [1], [2]
Nr is estimated by Nr = 50/w, where w is the number of trials whose best 100-episode performance surpasses the target score
From the results demonstrated by Fig.8, the policy return method can help Deep deterministic policy gradient (DDPG) avoid divergence during the learning process
Summary
D EEP reinforcement learning is a combination of deep learning and reinforcement learning, enabling it to possess both the perception ability of deep learning and the policy-making ability of reinforcement learning [1], [2]. Models based on deep reinforcement learning have achieved excellent performances across various domains, for instance, beating the top human players in Go matches [3], [4], controlling the operation of complex machinery [5] – [8], allocating network resources [9], improving wireless communication technologies [10], [11], etc. If the training process is performed multiple times, the learning curves acquire similar distributions. These phenomena do not occur with deep reinforcement learning. Most often, when using the same algorithm and hyperparameters, the results of multiple training processes differ widely in deep reinforcement learning. Even a DRL model that produces an excellent policy could, in all probability, result in a nonconvergent learning
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.