Abstract

Demystifying the Curse of Horizon in Offline Reinforcement Learning in Order to Break It Offline reinforcement learning (RL), where we evaluate and learn new policies using existing off-policy data, is crucial in applications where experimentation is challenging and simulation unreliable, such as medicine. It is also notoriously difficult because the similarity (density ratio) between observed trajectories and those generated by any new policy diminishes exponentially as the horizon grows, known as the curse of horizon, which severely limits the application of offline RL whenever horizons are moderately long or even infinite. In “Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning,” Kallus and Uehara set out to understand these limits and when they can be broken. They precisely characterize the curse by deriving the semiparametric efficiency lower bounds for the policy-value estimation problem in different models. On the one hand, this shows why the curse necessarily plagues standard estimators: they work even in non-Markov models and therefore must be limited by the corresponding bound. On the other hand, greater efficiency is possible in certain Markovian models, and they give the first estimator achieving these much lower efficiency bounds in infinite-horizon Markov decision processes.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.