Multi-objective reinforcement learning based on nonlinear scalarization and long-short-term optimization

Hongze Wang

doi:10.1108/ria-11-2023-0174

Hongze Wang

https://doi.org/10.1108/ria-11-2023-0174

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

PurposeMany practical control problems require achieving multiple objectives, and these objectives often conflict with each other. The existing multi-objective evolutionary reinforcement learning algorithms cannot achieve good search results when solving such problems. It is necessary to design a new multi-objective evolutionary reinforcement learning algorithm with a stronger searchability.Design/methodology/approachThe multi-objective reinforcement learning algorithm proposed in this paper is based on the evolutionary computation framework. In each generation, this study uses the long-short-term selection method to select parent policies. The long-term selection is based on the improvement of policy along the predefined optimization direction in the previous generation. The short-term selection uses a prediction model to predict the optimization direction that may have the greatest improvement on overall population performance. In the evolutionary stage, the penalty-based nonlinear scalarization method is used to scalarize the multi-dimensional advantage functions, and the nonlinear multi-objective policy gradient is designed to optimize the parent policies along the predefined directions.FindingsThe penalty-based nonlinear scalarization method can force policies to improve along the predefined optimization directions. The long-short-term optimization method can alleviate the exploration-exploitation problem, enabling the algorithm to explore unknown regions while ensuring that potential policies are fully optimized. The combination of these designs can effectively improve the performance of the final population.Originality/valueA multi-objective evolutionary reinforcement learning algorithm with stronger searchability has been proposed. This algorithm can find a Pareto policy set with better convergence, diversity and density.

Full Text