Is Pessimism Provably Efficient for Offline Reinforcement Learning?

Ying Jin,Zhuoran Yang,Zhaoran Wang

doi:10.1287/moor.2022.0216

Abstract

We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a data set collected a priori. Because of the lack of further interactions with the environment, offline RL suffers from the insufficient coverage of the data set, which eludes most existing theoretical analysis. In this paper, we propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function. Such a penalty function simply flips the sign of the bonus function for promoting exploration in online RL, which makes it easily implementable and compatible with general function approximators. Without assuming the sufficient coverage of the data set (e.g., finite concentratability coefficients or uniformly lower-bounded densities of visitation measures), we establish a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). When specialized to linear MDPs, it matches the information-theoretic lower bound up to multiplicative factors of the dimension and horizon. In other words, pessimism is not only provably efficient but also, minimax optimal. In particular, given the data set, the learned policy serves as the “best effort” among all policies as no other policies can do better. Our theoretical analysis identifies the critical role of pessimism in eliminating a notion of spurious correlation, which arises from the “irrelevant” trajectories that are less covered by the data set and not informative for the optimal policy. Funding: Z. Yang acknowledges the Simons Institute (Theory of Reinforcement Learning). Z. Wang acknowledges the National Science Foundation [Awards 2048075, 2008827, 2015568, and 1934931], the Simons Institute (Theory of Reinforcement Learning), Amazon, J. P. Morgan, and Two bSigma.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Is Pessimism Provably Efficient for Offline Reinforcement Learning?

Abstract

Talk to us

Similar Papers

More From: Mathematics of Operations Research

Lead the way for us

Similar Papers

Modeling error detection in human brain: A preliminary unification of reinforcement learning and conflict monitoring theories
Sareh Zendehrouh ... Farzad Towhidkhah
Neurocomputing | VOL. 103
Sareh Zendehrouh, et. al.Sareh Zendehrouh ... Farzad Towhidkhah
29 Sep 2012
Neurocomputing | VOL. 103

A Critical Study on Multi-agent System Based on Reinforcement Learning Theory and its Application in Research of Electricity Market Simulation
Yuhui Song ... Zishen Gan
-
Yuhui Song, et. al.Yuhui Song ... Zishen Gan
28 May 2021
28 May 2021

Synergizing reinforcement learning and game theory—A new direction for control
Rajneesh Sharma ... M. Gopal
Applied Soft Computing | VOL. 10
Rajneesh Sharma, et. al.Rajneesh Sharma ... M. Gopal
25 Oct 2009
Applied Soft Computing | VOL. 10

Contraction Mappings in the Theory Underlying Dynamic Programming
Eric V Denardo
SIAM Review | VOL. 9
Eric V DenardoEric V Denardo
01 Apr 1967
SIAM Review | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Is Pessimism Provably Efficient for Offline Reinforcement Learning?

Abstract

Talk to us

Similar Papers

More From: Mathematics of Operations Research