Abstract

Offline reinforcement learning (RL) aims at learning an optimal policy from a static offline data set, without interacting with the environment. However, the theoretical understanding of the existing offline RL methods needs further studies, among which the conservatism of the learned Q-function and the learned policy is a major issue. In this article, we propose a simple and efficient offline RL with relaxed conservatism (ORL-RC) framework for addressing this concern by learning a Q-function that is close to the true Q-function under the learned policy. The conservatism of learned Q-functions and policies of offline RL methods is analyzed. The analysis results support that the conservatism can lead to policy performance degradation. We establish the convergence results of the proposed ORL-RC, and the bounds of learned Q-functions with and without sampling errors, respectively, suggesting that the gap between the learned Q-function and the true Q-function can be reduced by executing the conservative policy improvement. A practical implementation of ORL-RC is presented and the experimental results on the D4RL benchmark suggest that ORL-RC exhibits superior performance and substantially outperforms existing state-of-the-art offline RL methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.