A criterion for selecting the appropriate one from the trained models for model‐based offline policy evaluation

Chongchong Li,Zhi‐Ming Ma,Yuting Liu,Yue Wang

doi:10.1049/cit2.12376

Abstract

AbstractOffline policy evaluation, evaluating and selecting complex policies for decision‐making by only using offline datasets is important in reinforcement learning. At present, the model‐based offline policy evaluation (MBOPE) is widely welcomed because of its easy to implement and good performance. MBOPE directly approximates the unknown value of a given policy using the Monte Carlo method given the estimated transition and reward functions of the environment. Usually, multiple models are trained, and then one of them is selected to be used. However, a challenge remains in selecting an appropriate model from those trained for further use. The authors first analyse the upper bound of the difference between the approximated value and the unknown true value. Theoretical results show that this difference is related to the trajectories generated by the given policy on the learnt model and the prediction error of the transition and reward functions at these generated data points. Based on the theoretical results, a new criterion is proposed to tell which trained model is better suited for evaluating the given policy. At last, the effectiveness of the proposed criterion is demonstrated on both benchmark and synthetic offline datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A criterion for selecting the appropriate one from the trained models for model‐based offline policy evaluation

Abstract

Talk to us

Similar Papers

More From: CAAI Transactions on Intelligence Technology

Lead the way for us

Journal: CAAI Transactions on Intelligence Technology	Publication Date: Oct 9, 2024
License type: CC BY-NC-ND 4.0

Similar Papers

Reward-predictive representations generalize across tasks in reinforcement learning
Lucas Lehnert ... Michael L Littman
-
Lucas Lehnert, et. al.Lucas Lehnert ... Michael L Littman
15 Oct 2020
15 Oct 2020

Reward-predictive representations generalize across tasks in reinforcement learning.
Lucas Lehnert ... Michael L Littman
PLOS Computational Biology | VOL. 16
Lucas Lehnert, et. al.Lucas Lehnert ... Michael L Littman
15 Oct 2020
PLOS Computational Biology | VOL. 16

Compositional clustering in task structure learning.
Nicholas T Franklin ... Michael J Frank
PLOS Computational Biology | VOL. 14
Nicholas T Franklin, et. al.Nicholas T Franklin ... Michael J Frank
19 Apr 2018
PLOS Computational Biology | VOL. 14

Definable Zero-Sum Stochastic Games
Jérôme Bolte ... Guillaume Vigeral
Mathematics of Operations Research | VOL. 40
Jérôme Bolte, et. al.Jérôme Bolte ... Guillaume Vigeral
14 Nov 2013
Mathematics of Operations Research | VOL. 40

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A criterion for selecting the appropriate one from the trained models for model‐based offline policy evaluation

Abstract

Talk to us

Similar Papers

More From: CAAI Transactions on Intelligence Technology