A key open question in adaptive estimator design is how to assure that the parameters of the proposed algorithms are converging to their almost correct solutions; hence, the learning algorithm is unbiased. Moreover, determining the speed of parameter convergence is important as it provides insight about the performance of the learning algorithms. The main contributions of the article are fourfold: the first one is that the article, initially, introduces an adaptive estimator to learn the discounted Q-function and approximate optimal control policy without requiring linear, discrete time, unstable output error system dynamics, but using only the noisy system measurements. The simulation results show that the adaptive estimator minimizes the stochastic cost function and temporal difference error and also learns the approximate Q-function together with the control policy. The second one is consideration of a different approach by taking a simple test problem to investigate issues associated with the Q-function’s representation and parametric convergence. In particular, the terminal convergence problem is analyzed with a known optimal control policy where the aim is to accurately learn only the Q-function. It is parameterized by terms which are functions of the unknown plant’s parameters and the Q-function’s discount factor, and their convergence properties are analyzed and compared with the adaptive estimator. The third one is to show that even though the adaptive estimator with a large Q-function discount factor yields larger control feedback gains, so that faster state converges upright, the learning problem is badly conditioned; hence, the parameter convergence is sluggish, as the Q-function discount factor approaches the inverse of the dominant pole of the unstable system. Finally, the fourth one is comparison of the state output learned by the adaptive estimator with the ones obtained from traditional system identification algorithms. Simulation result for a higher order unstable output error system shows that the adaptive estimator closely follows the real system output whereas the system identification algorithms do not.
Read full abstract