Abstract

Reinforcement Learning (RL) within non-stationary environments presents a formidable challenge. In some applications, anticipating abrupt alterations in the environment model might be possible. The existing literature lacks a framework that proactively harnesses such predictions to enhance reward optimization. This paper introduces an innovative methodology designed to preemptively leverage these predictions, thereby maximizing the overall achieved performance. This is executed by formulating a novel approach that generates a weighted mixture policy from both the optimal policies of the prevailing and forthcoming models. To ensure safe learning, an adaptive learning rate is derived to facilitate training of the weighted mixture policy. This theoretically guarantees monotonic performance improvement at each update during training. Empirical trials focus on a model-free predictive reference tracking scenario involving piecewise constant references. Through the utilization of the cart–pole position control problem, it is demonstrated that the proposed algorithm surpasses prior techniques such as context Q-learning and RL with context detection algorithms in non-stationary environments. Moreover, the algorithm outperforms the application of individual optimal policies derived from each observed environment model (i.e., policies not utilizing predictions).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call