Verifiably safe exploration for end-to-end reinforcement learning

Nathan Hunt,Subhro Das,Sara Magliacane,Trong Nghia Hoang,Armando Solar-Lezama,Nathan Fulton

doi:10.1145/3447928.3456653

Nathan Hunt, Subhro Das + Show 4 more

Open Access

https://doi.org/10.1145/3447928.3456653

Copy DOI

Abstract

Deploying deep reinforcement learning in safety-critical settings requires developing algorithms that obey hard constraints during exploration. This paper contributes a first approach toward enforcing formal safety constraints on end-to-end policies with visual inputs. Our approach draws on recent advances in object detection and automated reasoning for hybrid dynamical systems. The approach is evaluated on a novel benchmark that emphasizes the challenge of safely exploring in the presence of hard constraints. Our benchmark draws from several proposed problem sets for safe learning and includes problems that emphasize challenges such as reward signals that are not aligned with safety constraints. On each of these benchmark problems, our algorithm completely avoids unsafe behavior while remaining competitive at optimizing for as much reward as is safe. We characterize safety constraints in terms of a refinement relation on Markov decision processes - rather than directly constraining the reinforcement learning algorithm so that it only takes safe actions, we instead refine the environment so that only safe actions are defined in the environment's transition structure. This has pragmatic system design benefits and, more importantly, provides a clean conceptual setting in which we are able to prove important safety and efficiency properties. These allow us to transform the constrained optimization problem of acting safely in the original environment into an unconstrained optimization in a refined environment.

Highlights

Deep reinforcement learning algorithms [46] are effective at learning, often from raw sensor inputs, control policies that optimize for a quantitative reward signal
Our experiments demonstrate that VSRL is capable of optimizing for reward structure related to objects whose positions we do not extract via supervised training
We evaluate VSRL on four environments: a discrete XO environment [13], an adaptive cruise control environment (ACC), a 2D goal-finding environment (GF) similar to the Open AI Safety Gym Goal environment [43] but without a MuJoCo dependency and with simpler dynamics, and a pointmesses environment that emphasizes the problem of preventing reward hacking in safe exploration systems (PM)

Summary

Introduction

Deep reinforcement learning algorithms [46] are effective at learning, often from raw sensor inputs, control policies that optimize for a quantitative reward signal. Learning these policies can require experiencing millions of unsafe actions. One effective implementation and extension of this idea is proximal policy optimization (PPO), which improves sample efficiency and stability by sampling data in batches and optimizing a surrogate objective function that prevents overly large policy updates [45] This enables end-to-end learning through gradient descent which significantly reduces the dependency of the learning task on refined domain knowledge. To ensure formal guarantees we use differential Dynamic Logic (dL) [36,37,38, 40], a logic for specifying and proving reachability properties of hybrid dynamical systems, which combine both

Objectives

Methods

Conclusion