Abstract

This paper proposes an optimistic value iteration for steady-state mean–variance optimization in infinite-horizon discounted Markov decision processes (MDPs). The involved variance metric concerns reward variability in the long run, and future deviations are discounted to their present values. This mean–variance optimality criterion is time-inconsistent since its reward function depends on the mean, which renders traditional dynamic programming methods inapplicable. A family of policy/value iteration algorithms can be developed under a bilevel optimization algorithm framework, but there are still several problems to solve for a reinforcement learning (RL) extension. One problem is that in the value iteration, the inner optimization should achieve a near-optimal solution to ensure convergence before the outer update, which is required in every outer iteration. However, in an RL scenario, it is impractical to determine when the inner optimization should be stopped. To deal with this problem, we propose an optimistic value iteration, where the outer updates can be merged into the inner optimization with a learning rate. We prove the algorithm convergence and conduct a numerical experiment on portfolio management to validate the proposed algorithm.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call