Minimum information divergence of Q-functions for dynamic treatment resumes

Shinto Eguchi

doi:10.1007/s41884-022-00084-8

Abstract

This paper aims at presenting a new application of information geometry to reinforcement learning focusing on dynamic treatment resumes. In a standard framework of reinforcement learning, a Q-function is defined as the conditional expectation of a reward given a state and an action for a single-stage situation. We introduce an equivalence relation, called the policy equivalence, in the space of all the Q-functions. A class of information divergence is defined in the Q-function space for every stage. The main objective is to propose an estimator of the optimal policy function by a method of minimum information divergence based on a dataset of trajectories. In particular, we discuss the \(\gamma \)-power divergence that is shown to have an advantageous property such that the \(\gamma \)-power divergence between policy-equivalent Q-functions vanishes. This property essentially works to seek the optimal policy, which is discussed in a framework of a semiparametric model for the Q-function. The specific choices of power index \(\gamma \) give interesting relationships of the value function, and the geometric and harmonic means of the Q-function. A numerical experiment demonstrates the performance of the minimum \(\gamma \)-power divergence method in the context of dynamic treatment regimes.

Full Text