Abstract

A promising direction for applying reinforcement learning to the real world is learning from offline datasets. Offline reinforcement learning aims to learn policies from pre-collected datasets without online interaction with the environment. Due to the lack of further interaction, offline reinforcement learning faces severe extrapolation error, leading to policy learning failure. In this paper, we investigate the weighted Bellman update in model-based offline reinforcement learning. We explore uncertainty estimation in ensemble dynamics models, then use a variational autoencoder to fit the behavioral prior, and finally propose an algorithm called Model-Based Offline Weighted Policy Optimization (MOWPO), which uses a combination of model confidence and behavioral prior as weights to reduce the impact of inaccurate samples on policy optimization. Experiment results show that MOWPO achieves better performance than state-of-the-art algorithms, and both the model confidence weight and the behavioral prior weight can play an active role in offline policy optimization.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call