Multi-objective Reinforcement Learning Algorithm Research Articles

Many real-world problems involve the optimization of multiple, possibly conflicting objectives. Multi-objective reinforcement learning (MORL) is a generalization of standard reinforcement learning where the scalar reward signal is extended to multiple feedback signals, in essence, one for each objective. MORL is the process of learning policies that optimize multiple criteria simultaneously. In this paper, we present a novel temporal difference learning algorithm that integrates the Pareto dominance relation into a reinforcement learning approach. This algorithm is a multi-policy algorithm that learns a set of Pareto dominating policies in a single run. We name this algorithm Pareto Q-learning and it is applicable in episodic environments with deterministic as well as stochastic transition functions. A crucial aspect of Pareto Q-learning is the updating mechanism that bootstraps sets of Q-vectors. One of our main contributions in this paper is a mechanism that separates the expected immediate reward vector from the set of expected future discounted reward vectors. This decomposition allows us to update the sets and to exploit the learned policies consistently throughout the state space. To balance exploration and exploitation during learning, we also propose three set evaluation mechanisms. These three mechanisms evaluate the sets of vectors to accommodate for standard action selection strategies, such as e-greedy. More precisely, these mechanisms use multi-objective evaluation principles such as the hypervolume measure, the cardinality indicator and the Pareto dominance relation to select the most promising actions. We experimentally validate the algorithm on multiple environments with two and three objectives and we demonstrate that Pareto Q-learning outperforms current state-of-the-art MORL algorithms with respect to the hypervolume of the obtained policies. We note that (1) Pareto Q-learning is able to learn the entire Pareto front under the usual assumption that each state-action pair is sufficiently sampled, while (2) not being biased by the shape of the Pareto front. Furthermore, (3) the set evaluation mechanisms provide indicative measures for local action selection and (4) the learned policies can be retrieved throughout the state and action space.

[1] The operation of large-scale water resources systems often involves several conflicting and noncommensurable objectives. The full characterization of tradeoffs among them is a necessary step to inform and support decisions in the absence of a unique optimal solution. In this context, the common approach is to consider many single objective problems, resulting from different combinations of the original problem objectives, each one solved using standard optimization methods based on mathematical programming. This scalarization process is computationally very demanding as it requires one optimization run for each trade-off and often results in very sparse and poorly informative representations of the Pareto frontier. More recently, bio-inspired methods have been applied to compute an approximation of the Pareto frontier in one single run. These methods allow to acceptably cover the full extent of the Pareto frontier with a reasonable computational effort. Yet, the quality of the policy obtained might be strongly dependent on the algorithm tuning and preconditioning. In this paper we propose a novel multiobjective Reinforcement Learning algorithm that combines the advantages of the above two approaches and alleviates some of their drawbacks. The proposed algorithm is an extension of fitted Q-iteration (FQI) that enables to learn the operating policies for all the linear combinations of preferences (weights) assigned to the objectives in a single training process. The key idea of multiobjective FQI (MOFQI) is to enlarge the continuous approximation of the value function, that is performed by single objective FQI over the state-decision space, also to the weight space. The approach is demonstrated on a real-world case study concerning the optimal operation of the HoaBinh reservoir on the Da river, Vietnam. MOFQI is compared with the reiterated use of FQI and a multiobjective parameterization-simulation-optimization (MOPSO) approach. Results show that MOFQI provides a continuous approximation of the Pareto front with comparable accuracy as the reiterated use of FQI. MOFQI outperforms MOPSO when no a priori knowledge on the operating policy shape is available, while produces slightly less accurate solutions when MOPSO can exploit such knowledge.

Multi-objective Reinforcement Learning Algorithm Research Articles

Related Topics

Articles published on Multi-objective Reinforcement Learning Algorithm

Multi-objective reinforcement learning using sets of pareto dominating policies

A multiobjective reinforcement learning approach to water resources systems operation: Pareto frontier approximation in a single run

"Notice of Violation of IEEE Publication Principles" Multiobjective Reinforcement Learning: A Comprehensive Overview.

Empirical evaluation methods for multiobjective reinforcement learning algorithms

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Multi-objective Reinforcement Learning Algorithm Research Articles

Related Topics

Articles published on Multi-objective Reinforcement Learning Algorithm

Multi-objective reinforcement learning using sets of pareto dominating policies

A multiobjective reinforcement learning approach to water resources systems operation: Pareto frontier approximation in a single run

"Notice of Violation of IEEE Publication Principles" Multiobjective Reinforcement Learning: A Comprehensive Overview.

Empirical evaluation methods for multiobjective reinforcement learning algorithms