Policy Return: A New Method for Reducing the Number of Experimental Trials in Deep Reinforcement Learning

Feng Liu,Shuling Dai,Yongjia Zhao

doi:10.1109/access.2020.3045835

Abstract

Using the same algorithm and hyperparameter configurations, deep reinforcement learning (DRL) will derive drastically different results from multiple experimental trials, and most of these results are unsatisfactory. Because of the instability of the results, researchers have to perform many trials to confirm an algorithm or a set of hyperparameters in DRL. In this article, we present the policy return method, which is a new design for reducing the number of trials when training a DRL model. This method allows the learned policy to return to a previous state when it becomes divergent or stagnant at any stage of training. When returning, a certain percentage of stochastic data is added to the weights of the neural networks to prevent a repeated decline. Extensive experiments on challenging tasks and various target scores demonstrate that the policy return method can bring about a 10% to 40% reduction in the required number of trials compared with that of the corresponding original algorithm, and a 10% to 30% reduction compared with the state-of-the-art algorithms.

Highlights

D EEP reinforcement learning is a combination of deep learning and reinforcement learning, enabling it to possess both the perception ability of deep learning and the policy-making ability of reinforcement learning [1], [2]
Nr is estimated by Nr = 50/w, where w is the number of trials whose best 100-episode performance surpasses the target score
From the results demonstrated by Fig.8, the policy return method can help Deep deterministic policy gradient (DDPG) avoid divergence during the learning process

Summary

Introduction

D EEP reinforcement learning is a combination of deep learning and reinforcement learning, enabling it to possess both the perception ability of deep learning and the policy-making ability of reinforcement learning [1], [2]. Models based on deep reinforcement learning have achieved excellent performances across various domains, for instance, beating the top human players in Go matches [3], [4], controlling the operation of complex machinery [5] – [8], allocating network resources [9], improving wireless communication technologies [10], [11], etc. If the training process is performed multiple times, the learning curves acquire similar distributions. These phenomena do not occur with deep reinforcement learning. Most often, when using the same algorithm and hyperparameters, the results of multiple training processes differ widely in deep reinforcement learning. Even a DRL model that produces an excellent policy could, in all probability, result in a nonconvergent learning

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 16	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Policy Return: A New Method for Reducing the Number of Experimental Trials in Deep Reinforcement Learning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Deep Reinforcement Learning: A New Frontier in Computer Vision Research
Sejuti Rahman ... Sujan Sarker
-
Sejuti Rahman, et. al.Sejuti Rahman ... Sujan Sarker
01 Jan 2020
01 Jan 2020

Leveraging deep reinforcement learning for design space exploration with multi-fidelity surrogate model
Haokun Li ... Yan Yan
Journal of Engineering Design | VOL. ahead-of-print
Haokun Li, et. al.Haokun Li ... Yan Yan
25 Jun 2024
Journal of Engineering Design | VOL. ahead-of-print

Deep reinforcement learning and its applications in medical imaging and radiation therapy: a survey
Lanyu Xu ... Ning Wen
Physics in Medicine & Biology | VOL. 67
Lanyu Xu, et. al.Lanyu Xu ... Ning Wen
11 Nov 2022
Physics in Medicine & Biology | VOL. 67

Sample effficient deep reinforcement learning for control

-

15 Dec 2019
15 Dec 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Policy Return: A New Method for Reducing the Number of Experimental Trials in Deep Reinforcement Learning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access