Inverse reinforcement learning in contextual MDPs

Stav Belogolovsky,Shie Mannor,Chen Tessler,Philip Korsunsky,Tom Zahavy

doi:10.1007/s10994-021-05984-x

Stav Belogolovsky, Shie Mannor + Show 3 more

Open Access

https://doi.org/10.1007/s10994-021-05984-x

Copy DOI

Journal: Machine Learning	Publication Date: May 12, 2021
Citations: 7	License type: open-access

Affiliation: Technion – Israel Institute of Technology

Abstract

We consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.

Highlights

Real-world sequential decision making problems often share three important properties — (1) the reward function is often unknown, yet (2) expert demonstrations can be acquired, and (3) the reward and/or dynamics often depend on a static parameter, known as the context
While the grid world focused on comparing Contextual Inverse Reinforcement Learning (COIRL) with the standard IRL method, we compare the various methods for performing COIRL in an autonomous driving simulator (Fig. 5)
The mapping from the contexts to the true reward is constructed in a way that induces different behaviors for different contexts, making generalization a challenging task

Summary

Introduction

Real-world sequential decision making problems often share three important properties — (1) the reward function is often unknown, yet (2) expert demonstrations can be acquired, and (3) the reward and/or dynamics often depend on a static parameter, known as the context. Consider a dynamic treatment regime (Chakraborty & Murphy 2014), where a clinician acts to improve a patient’s medical condition. While the patient’s dynamic measurements, e.g., heart rate and blood pressure, define the state, there are static parameters, e.g., age and weight, which determine how the patient reacts to certain treatments and what form of treatment is optimal. As opposed to traditional medicine, which provide a treatment for the “average patient”, in the contextual model, patients are separated into different groups for which the medical decisions are tailored (Fig. 1). This enables the decision maker to provide tailored decisions (e.g., treatments) which are more effective, based on these static parameters

Objectives

Methods

Discussion

Conclusion