A temporal difference method for multi-objective reinforcement learning

Manuela Ruiz-Montiel,Lawrence Mandow,José-Luis Pérez-De-La-Cruz

doi:10.1016/j.neucom.2016.10.100

Manuela Ruiz-Montiel, Lawrence Mandow + Show 1 more

Open Access

https://doi.org/10.1016/j.neucom.2016.10.100

Copy DOI

Journal: Neurocomputing	Publication Date: Jun 22, 2017
Citations: 26	License type: other-oa

Affiliation: Universidad de Málaga

Abstract

This work describes MPQ-learning, an algorithm that approximates the set of all deterministic non-dominated policies in multi-objective Markov decision problems, where rewards are vectors and each component stands for an objective to maximize. MPQ-learning generalizes directly the ideas of Q-learning to the multi-objective case. It can be applied to non-convex Pareto frontiers and finds both supported and unsupported solutions. We present the results of the application of MPQ-learning to some benchmark problems. The algorithm solves successfully these problems, so showing the feasibility of this approach. We also compare MPQ-learning to a standard linearization procedure that computes only supported solutions and show that in some cases MPQ-learning can be as effective as the scalarization method.

Full Text