Abstract
Offline reinforcement learning (RL) aims to learn policies from static datasets. The value overestimation of out-of-distribution (OOD) actions makes it difficult to directly apply general RL methods in the offline setting. To overcome this problem, many works focus on estimating the value function conservatively or pessimistically. However, existing methods require additional OOD sampling or uncertainty estimation to underestimate OOD values, making them complex and vulnerable to hyperparameters. Is it possible to design a specific value function that can automatically be conservative on OOD samples? In this study, we reveal the anti-conservation property of the widely used ReLU network under certain conditions and explain the reason theoretically. Based on the analysis of the ReLU network, we propose a novel neural network architecture that pushes down the value of those samples far away from the datasets; we call this kind of new architecture the Conservative Network (ConsNet). Based on ConsNet, a new offline RL algorithm with simple implementation and high performance is proposed. Since we can obtain additional conservation from the ConsNet itself, by integrating the ConsNet into several existing offline RL methods, we find that it can significantly improve the performance or reduce the original algorithm complexity. With its simplicity and superiority, we hope that ConsNet could be a new fundamental network architecture for offline RL.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.