Abstract

Offline reinforcement learning (RL) methods learn from datasets without further environment interaction, facing errors due to out-of-distribution (OOD) actions. Although effective methods have been proposed to conservatively estimate the Q-values of those OOD actions to mitigate this problem, insufficient or excessive pessimism under constant constraints often harms the policy learning process. Moreover, since the distribution of each task on the dataset varies among different environments and behavior policies, it is desirable to learn an adaptive weight for balancing constraints on the conservative estimation of Q-value and the standard RL objectives depending on each task. To achieve this, in this paper, we point out that the quantile of the Q-value is an effective metric to refer to the Q-value distribution of the fixed data set. Based on this observation, we design Adaptive Pessimism via a Target Q-value (APTQ) algorithm that balances between the pessimism constraint and the RL objective; this leads the expectation of Q-value to stably converge to a given target Q-value from a reasonable quantile of the Q-value distribution of the dataset. Experiments show that our method remarkably improves the performance of the state-of-the-art method CQL by 6.20% on the D4RL-v0 and 1.89% on the D4RL-v2.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call