Abstract

Off-policy evaluation is an important topic in reinforcement learning, which estimates the expected cumulative reward of a target policy using logged trajectory data generated from a different behavior policy, without execution of the target policy. It is imperative to quantify the uncertainty of the off-policy estimate before deployment of the target policy. Here we leverage methodologies from (Wasserstein) distributionally robust optimization to provide robust and optimistic cumulative reward estimates. With proper selection of the size of the distributional uncertainty set, these estimates serve as confidence bounds with nonasymptotic and asymptotic guarantees under stochastic or adversarial environments. We also generalize those results to batch reinforcement learning.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.