Least squares policy iteration with instrumental variables vs. direct policy search: comparison against optimal benchmarks using energy storage

Somayeh Moazeni,Warren R Scott,Warren B Powell

doi:10.1080/03155986.2019.1624491

Somayeh Moazeni, Warren R Scott + Show 1 more

Open Access

https://doi.org/10.1080/03155986.2019.1624491

Copy DOI

Abstract

This article studies least-squares approximate policy iteration (API) methods with parametrized value-function approximation. We study several variations of the policy evaluation phase, namely, Bellman error minimization, Bellman error minimization with instrumental variables, projected Bellman error minimization, and projected Bellman error minimization with instrumental variables. For a general discrete-time stochastic control problem, Bellman error minimization policy evaluation using instrumental variables is equivalent to both variants of the projected Bellman error minimization. An alternative to these API methods is direct policy search based on knowledge gradient. The practical performance of these three approximate dynamic programming methods, (i) least squares API with Bellman error minimization, (ii) least squares API with Bellman error minimization with instrumental variables, and (iii) direct policy search, are investigated in the context of an application in energy storage operations management. We create a library of test problems using real-world data and apply value iteration to find their optimal policies. These optimal benchmarks are then used to compare the developed approximate dynamic programming policies. Our analysis indicates that least-squares API with instrumental variables Bellman error minimization prominently outperforms least-squares API with Bellman error minimization. However, these approaches underperform our direct policy search implementation.

Full Text