Abstract

This paper presents a model-free Q-learning algorithm for solving the risk-averse optimal control (RAOC) problem. The entropic risk measure is used in the RAOC to account for the variance of the objective function. A one-shot Q-based convex optimization problem is then formed for which the decision variables are the Q-function parameters and the constraints are formed by sampling from an exponential utility-based entropic Bellman inequality. Samples are constructed using only a batch of data collected from a variety of control policies in a fully off-policy manner, which turns a dataset into a Q-learning based risk-averse optimal policy engine. Convergence of the exact optimization problem, which is infinite- dimensional in decision variables and constraints, to the optimal risk-averse Q-function is shown. For the standard convex optimization problem for which function approximation for Q-value estimations as well as constraint sampling are leveraged, the performance of the approximated solutions is verified through a weighted-norm bound and the Lyapunov bound. A simulation example is provided to verify the effectiveness of the presented approach.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.