Abstract

In this paper, we consider distributionally robust Markov Decision Processes with Borel state and action spaces and infinite time horizon. The problem is formulated as a Stackelberg game where nature as a second player chooses the least favorable disturbance density in each scenario. Under suitable assumptions, we prove that the value function is the unique fixed point of an operator and that minimizers respectively, maximizers lead to optimal policies for the decision maker and nature. Based on this result, we introduce a Q-learning approach to solve the problem via simulation-based techniques. We prove the convergence of the Q-learning algorithm and study its performance using a distributionally robust irrigation problem.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call