Abstract

This work describes MPQ-learning, an algorithm that approximates the set of all deterministic non-dominated policies in multi-objective Markov decision problems, where rewards are vectors and each component stands for an objective to maximize. MPQ-learning generalizes directly the ideas of Q-learning to the multi-objective case. It can be applied to non-convex Pareto frontiers and finds both supported and unsupported solutions. We present the results of the application of MPQ-learning to some benchmark problems. The algorithm solves successfully these problems, so showing the feasibility of this approach. We also compare MPQ-learning to a standard linearization procedure that computes only supported solutions and show that in some cases MPQ-learning can be as effective as the scalarization method.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.