Abstract
Reinforcement learning (RL) agents can learn to control a nonlinear system without using a model of the system. However, having a model brings benefits, mainly in terms of a reduced number of unsuccessful trials before achieving acceptable control performance. Several modelling approaches have been used in the RL domain, such as neural networks, local linear regression, or Gaussian processes. In this article, we focus on techniques that have not been used much so far: symbolic regression (SR), based on genetic programming and local modelling. Using measured data, symbolic regression yields a nonlinear, continuous-time analytic model. We benchmark two state-of-the-art methods, SNGP (single-node genetic programming) and MGGP (multigene genetic programming), against a standard incremental local regression method called RFWR (receptive field weighted regression). We have introduced modifications to the RFWR algorithm to better suit the low-dimensional continuous-time systems we are mostly dealing with. The benchmark is a nonlinear, dynamic magnetic manipulation system. The results show that using the RL framework and a suitable approximation method, it is possible to design a stable controller of such a complex system without the necessity of any haphazard learning. While all of the approximation methods were successful, MGGP achieved the best results at the cost of higher computational complexity. Index Terms–AI-based methods, local linear regression, nonlinear systems, magnetic manipulation, model learning for control, optimal control, reinforcement learning, symbolic regression.
Highlights
A reinforcement learning (RL) agent interacts with the system to be controlled by measuring its states and applying actions according to a policy so that a given goal state is attained. e policy is iteratively adapted in such a way that the agent receives the highest possible cumulative reward, which is a scalar value accumulated over trajectories in the system’s state space. e reward associated with each transition in the state space is described by a predefined value function.Existing RL algorithms can be divided into critic-only, actor-only, and actor-critic variants. e critic-only variants optimize the value function (V-function) that is used to derive the policy; the actor-only variants work directly on the policy optimization without any need for a value function; and actor-critic variants optimize both functions simultaneously
We compared the performance of various models of a complex nonlinear system created with three different approximation methods, two of which were based on genetic programming and the third was based on a modified local linear approximation algorithm (RFWR)
Most of the models selected for the actual control experiments were successful in achieving stable control, they differ in precision. e histogram in Figure 10 shows the number of models for the two GP-based algorithms (SNGP and multigene genetic programming (MGGP)), which fall into several MSE categories. e MSE describes the control precision as the mean-squared error between the goal and the actual trajectory of the closed-loop controlled system
Summary
A reinforcement learning (RL) agent interacts with the system to be controlled by measuring its states and applying actions according to a policy so that a given goal state is attained. e policy is iteratively adapted in such a way that the agent receives the highest possible cumulative reward, which is a scalar value accumulated over trajectories in the system’s state space. e reward associated with each transition in the state space is described by a predefined value function.Existing RL algorithms can be divided into critic-only, actor-only, and actor-critic variants. e critic-only variants optimize the value function (V-function) that is used to derive the policy; the actor-only variants work directly on the policy optimization without any need for a value function; and actor-critic variants optimize both functions simultaneously. E policy is iteratively adapted in such a way that the agent receives the highest possible cumulative reward, which is a scalar value accumulated over trajectories in the system’s state space. From a different point of view, RL algorithms can be divided into model-based and model-free variants. Examples of both approaches can be found in [3, 4]. Model-free methods learn online exclusively through trial and error. Both variants have their specific advantages and disadvantages. We employ the model-based, critic-only variant without any online training so that we can compare different modelling approaches
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.