Distillation of RL Policies with Formal Guarantees via Variational Abstraction of Markov Decision Processes

Florent Delgrange,Ann Nowé,Guillermo A Pérez

doi:10.1609/aaai.v36i6.20602

Abstract

We consider the challenge of policy simplification and verification in the context of policies learned through reinforcement learning (RL) in continuous environments. In well-behaved settings, RL algorithms have convergence guarantees in the limit. While these guarantees are valuable, they are insufficient for safety-critical applications. Furthermore, they are lost when applying advanced techniques such as deep-RL. To recover guarantees when applying advanced RL algorithms to more complex environments with (i) reachability, (ii) safety-constrained reachability, or (iii) discounted-reward objectives, we build upon the DeepMDP framework to derive new bisimulation bounds between the unknown environment and a learned discrete latent model of it. Our bisimulation bounds enable the application of formal methods for Markov decision processes. Finally, we show how one can use a policy obtained via state-of-the-art RL to efficiently train a variational autoencoder that yields a discrete latent model with provably approximately correct bisimulation guarantees. Additionally, we obtain a distilled version of the policy for the latent model.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Distillation of RL Policies with Formal Guarantees via Variational Abstraction of Markov Decision Processes

Abstract

Talk to us

Similar Papers

More From: Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence	Publication Date: Jun 28, 2022
Citations: 1

Similar Papers

Reinforcement learning with algorithms from probabilistic structure estimation
Jonathan P Epperlein ... Robert Shorten
Automatica | VOL. 144
Jonathan P Epperlein, et. al.Jonathan P Epperlein ... Robert Shorten
06 Aug 2022
Automatica | VOL. 144

Combining manual feedback with subsequent MDP reward signals for reinforcement learning
...
-
, et. al. ...
10 May 2010
10 May 2010

Complex-Valued Reinforcement Learning: a Context-Based Approach for POMDPs
Takeshi Shibuya ... Tomoki Hamagami
-
Takeshi Shibuya, et. al.Takeshi Shibuya ... Tomoki Hamagami
14 Jan 2011
14 Jan 2011

Reinforcement learning from simultaneous human and MDP reward
...
-
, et. al. ...
04 Jun 2012
04 Jun 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Distillation of RL Policies with Formal Guarantees via Variational Abstraction of Markov Decision Processes

Abstract

Talk to us

Similar Papers

More From: Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence