English

Olivier Teytaud ,Sylvain Gelly

doi:10.5220/0001645800470054

Abstract

Many stochastic dynamic programming tasks in continuous action-spaces are tackled through discretization. We here avoid discretization; then, approximate dynamic programming (ADP) involves (i) many learning tasks, performed here by Support Vector Machines, for Bellman-function-regression (ii) many non-linearoptimization tasks for action-selection, for which we compare many algorithms. We include discretizations of the domain as particular non-linear-programming-tools in our experiments, so that by the way we compare optimization approaches and discretization methods. We conclude that robustness is strongly required in the non-linear-optimizations in ADP, and experimental results show that (i) discretization is sometimes inefficient, but some specific discretization is very efficient for ”bang-bang” problems (ii) simple evolutionary tools outperform quasi-random in a stable manner (iii) gradient-based techniques are much less stable (iv) for most high-dimensional ”less unsmooth” problems Covariance-Matrix-Adaptation is first ranked. 1 NON-LINEAR OPTIMIZATION IN STOCHASTIC DYNAMIC PROGRAMMING (SDP) Some of the most traditional fields of stochastic dynamic programming, e.g. energy stockmanagement, which have a strong economic impact, have not been studied thoroughly in the reinforcement learning or approximate-dynamic-programming (ADP) community. This is damageable to reinforcement learning as it has been pointed out that there are not yet many industrial realizations of reinforcement learning. Energy stock-management leads to continuous problems that are usually handled by traditional linear approaches in which (i) convex value-functions are approximated by linear cuts (leading to piecewise linear approximations (PWLA)) (ii) decisions are solutions of a linear-problem. However, this approach does not work in large dimension, due to the curse of dimensionality which strongly affects PWLA. These problems should be handled by other learning tools. However, in this case, the action-selection, minimizing the expected cost-to-go, can’t be anymore done using linear-programming, as the Bellman function is no more a convex PWLA. The action selection is therefore a nonlinear programming problem. There are not a lot of works dealing with continuous actions, and they often do not study the non-linear optimization step involved in action selection. In this paper, we focus on this part: we compare many non-linear optimization-tools, and we also compare these tools to discretization techniques to quantify the importance of the action-selection step. We here roughly introduce stochastic dynamic programming. The interested reader is referred to (Bertsekas and Tsitsiklis, 1996) for more details. Consider a dynamical system that stochastically evolves in time depending upon your decisions. Assume that time is discrete and has finitely many time steps. Assume that the total cost of your decisions is the sum of instantaneous costs. Precisely: cost = c1 + c2 + · · ·+ cT ci = c(xi,di), xi = f (xi−1,di−1,ωi) di−1 = strategy(xi−1,ωi) where xi is the state at time step i, the ωi are a random process, cost is to be minimized, and strategy is the decision function that has to be optimized. We are interested in a control problem: the element to be optimized is a function. Stochastic dynamic programming, a tool to solve this control problem, is based on Bellman’s optimality principle that can be informally stated as follows: ”Take the decision at time step t such that the sum ”cost at time step t due to your decision” plus ”expected cost from time step t + 1 to ∞” is minimal.” Bellman’s optimality principle states that this strategy is optimal. Unfortunately, it can only be applied if the expected cost from time step t + 1 to ∞ can be guessed, depending on the current state of the system and the decision. Bellman’s optimality principle reduces the control problem to the computation of this function. If xt can be computed from xt−1 and dt−1 (i.e., if f is known) then the control problem is reduced to the computation of a function V (t,xt) = E[c(xt ,dt)+ c(xt+1,dt+1)+ · · ·+ c(xT ,dT )] Note that this function depends on the strategy (we omit for short dependencies on the random process). We consider this expectation for any optimal strategy (even if many strategies are optimal, V is uniquely determined as it is the same for any optimal strategy). Stochastic dynamic programming is the computation of V backwards in time, thanks to the following equation: V (t,xt) = inf dt c(xt ,dt)+ EV(t + 1,xt+1)

Full Text