An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning.

Wenjia Meng,Qian Zheng,Gang Pan,Yue Shi

doi:10.1109/tnnls.2020.3044196

Abstract

In deep reinforcement learning, off-policy data help reduce on-policy interaction with the environment, and the trust region policy optimization (TRPO) method is efficient to stabilize the policy optimization procedure. In this article, we propose an off-policy TRPO method, off-policy TRPO, which exploits both on- and off-policy data and guarantees the monotonic improvement of policies. A surrogate objective function is developed to use both on- and off-policy data and keep the monotonic improvement of policies. We then optimize this surrogate objective function by approximately solving a constrained optimization problem under arbitrary parameterization and finite samples. We conduct experiments on representative continuous control tasks from OpenAI Gym and MuJoCo. The results show that the proposed off-policy TRPO achieves better performance in the majority of continuous control tasks compared with other trust region policy-based methods using off-policy data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on neural networks and learning systems

Lead the way for us

Journal: IEEE transactions on neural networks and learning systems	Publication Date: Jan 22, 2021
Citations: 23

Similar Papers

Authentic Boundary Proximal Policy Optimization.
Yuhu Cheng ... Longyang Huang
IEEE transactions on cybernetics | VOL. 52
Yuhu Cheng, et. al.Yuhu Cheng ... Longyang Huang
11 Mar 2021
IEEE transactions on cybernetics | VOL. 52

Research on Supply Chain Optimization and Management Based on Deep Reinforcement Learning
Gao Yunxiang ... Wang Zhao
Scalable Computing: Practice and Experience | VOL. 25
Gao Yunxiang, et. al.Gao Yunxiang ... Wang Zhao
01 Oct 2024
Scalable Computing: Practice and Experience | VOL. 25

Q-PrOP: Sample-efficient policy gradient with an off-policy critic
...
-
, et. al. ...
28 Feb 2017
28 Feb 2017

Trust region policy optimization via entropy regularization for Kullback–Leibler divergence constraint
Haotian Xu ... Jie Lu
Neurocomputing | VOL. 589
Haotian Xu, et. al.Haotian Xu ... Jie Lu
16 Apr 2024
Neurocomputing | VOL. 589

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on neural networks and learning systems