Abstract

The ability of an agent to detect changes in an environment is key to successful adaptation. This ability involves at least two phases: learning a model of an environment, and detecting that a change is likely to have occurred when this model is no longer accurate. This task is particularly challenging in partially observable environments, such as those modeled with partially observable Markov decision processes (POMDPs). Some predictive learners are able to infer the state from observations and thus perform better with partial observability. Predictive state representations (PSRs) and neural networks are two such tools that can be trained to predict the probabilities of future observations. However, most such existing methods focus primarily on static problems in which only one environment is learned. In this paper, we propose an algorithm that uses statistical tests to estimate the probability of different predictive models to fit the current environment. We exploit the underlying probability distributions of predictive models to provide a fast and explainable method to assess and justify the model's beliefs about the current environment. Crucially, by doing so, the method can label incoming data as fitting different models, and thus can continuously train separate models in different environments. This new method is shown to prevent catastrophic forgetting when new environments, or tasks, are encountered. The method can also be of use when AI-informed decisions require justifications because its beliefs are based on statistical evidence from observations. We empirically demonstrate the benefit of the novel method with simulations in a set of POMDP environments.

Highlights

  • A useful skill for an agent that explores the world and learns to act in it is the ability to predict what happens (Geisser, 1993)

  • A partially observable Markov decision processes (POMDPs) is defined as a hextuple {S, A, T, O, R}, where S is the set of underlying Markov decision process (MDP) states; A is the set of actions; T is the transition function, T : A × S × S → [0, 1], which gives the probability of transitioning from one state to another given the action taken; R is the reward function; O is the set of observations; and is the set of conditional observation probabilities

  • The proposed algorithm was tested on a set of POMDPs of various size and complexity

Read more

Summary

Introduction

A useful skill for an agent that explores the world and learns to act in it is the ability to predict what happens (Geisser, 1993). One way is to try to learn a model of the world so that predictions are generated within the agent and compared with observations to improve the model. Detecting Changes in POMDPs an agent may interact in different environments at different points in time This is a condition that challenges learning algorithms that often need to be set manually by a user that labels tasks or substitutes models for each new task or environment. Predictive models such as predictive state representations (Littman et al, 2002), neural networks and POMDPs have been extensively used in the past to model dynamical systems with discrete representations. Let the history hi ∈ H of the agent at time i be the stream of action-observation pairs aj ∈ A, oj ∈ O, ∀j ∈ [0, i) ∩ N observed up to time i

Objectives
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call