Abstract
Apprenticeship learning (AL) is a kind of Learning from Demonstration techniques where the reward function of a Markov Decision Process (MDP) is unknown to the learning agent and the agent has to derive a good policy by observing an expert’s demonstrations. In this paper, we study the problem of how to make AL algorithms inherently safe while still meeting its learning objective. We consider a setting where the unknown reward function is assumed to be a linear combination of a set of state features, and the safety property is specified in Probabilistic Computation Tree Logic (PCTL). By embedding probabilistic model checking inside AL, we propose a novel counterexample-guided approach that can ensure safety while retaining performance of the learnt policy. We demonstrate the effectiveness of our approach on several challenging AL scenarios where safety is essential.
Highlights
The rapid progress of artificial intelligence (AI) comes with a growing concern over its safety when deployed in real-life systems and situations
We consider safety specification expressed in Probabilistic Computation Tree Logic (PCTL) and show how probabilistic model checking can be used to ensure safety and retain performance of a learning algorithm known as apprenticeship learning (AL)
– We develop a novel algorithm called CounterExample Guided Apprenticeship Learning (CEGAL) that combines probabilistic model checking with the optimization-based approach of apprenticeship learning
Summary
The rapid progress of artificial intelligence (AI) comes with a growing concern over its safety when deployed in real-life systems and situations. The concept of AL is closely related to reinforcement learning (RL) where an agent learns what actions to take in an environment (known as a policy) by maximizing some notion of long-term reward. In AL, the agent is not given the reward function, but instead has to first estimate it from a set of expert demonstrations via a technique called inverse reinforcement learning [18]. An expert demonstrates the task by maximizing this reward function and the agent tries to derive a policy that can match the feature expectations of the expert’s demonstrations. One issue with LfD is that the expert often can only demonstrate how the task works but not how the task may fail This is because failure may cause irrecoverable damages to the system such as crashing a vehicle. Even if all the demonstrations are safe, the agent may still end up learning an unsafe policy
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.