Partial and Conditional Expectations in Markov Decision Processes with Integer Weights

Jakob Piribauer,Christel Baier

doi:10.1007/978-3-030-17127-8_25

Abstract

The paper addresses two variants of the stochastic shortest path problem (“optimize the accumulated weight until reaching a goal state”) in Markov decision processes (MDPs) with integer weights. The first variant optimizes partial expected accumulated weights, where paths not leading to a goal state are assigned weight 0, while the second variant considers conditional expected accumulated weights, where the probability mass is redistributed to paths reaching the goal. Both variants constitute useful approaches to the analysis of systems without guarantees on the occurrence of an event of interest (reaching a goal state), but have only been studied in structures with non-negative weights. Our main results are as follows. There are polynomial-time algorithms to check the finiteness of the supremum of the partial or conditional expectations in MDPs with arbitrary integer weights. If finite, then optimal weight-based deterministic schedulers exist. In contrast to the setting of non-negative weights, optimal schedulers can need infinite memory and their value can be irrational. However, the optimal value can be approximated up to an absolute error of \(\epsilon \) in time exponential in the size of the MDP and polynomial in \(\log (1/\epsilon )\).

Highlights

Stochastic shortest path (SSP) problems generalize the shortest path problem on graphs with weighted edges
Optimal values are achieved by weight-based deterministic schedulers that depend on the current state and the weight that has been accumulated so far, while memoryless schedulers are not sufficient
The optimal values can be irrational showing that the linear programming approaches from the setting of non-negative weights cannot be applied for the computation of optimal values

Summary

Introduction

Stochastic shortest path (SSP) problems generalize the shortest path problem on graphs with weighted edges. Conditional expectations in MDPs with non-negative weights have been addressed in [3] In both cases, optimal values are achieved by weight-based deterministic schedulers that depend on the current state and the weight that has been accumulated so far, while memoryless schedulers are not sufficient. If we add a new initial state making sure that the goal is reached with positive probability as in the MDP N , we can obtain an irrational maximal conditional expectation as well: The scheduler Tk choosing τ in c as soon as the weight reaches k has conditional expectation 1/2k+/21Φ/2kΦk. As we will see later, this implies that the existence of saturation points is no longer ensured and optimal schedulers might require infinite memory (more precisely, a counter for the accumulated weight) These observations provide evidence why linear-programming techniques as used in the case of non-negative MDPs [3,8] cannot be expected to be applicable for the general setting. The recent work on notions of conditional value at risk in MDPs [15] studies conditional expectations, but the considered random variables are limit averages and a notion of (non-accumulated) weight-bounded reachability

Preliminaries

Partial and Conditional Expectations in MDPs

Existence of Optimal Schedulers

Approximation

Conclusion