Policy iteration (PI), an iterative method in reinforcement learning, has the merit of interactions with a little-known environment to learn a decision law through policy evaluation and improvement. However, the existing PI-based results for output-feedback (OPFB) continuous-time systems relied heavily on an initial stabilizing full state-feedback (FSFB) policy. It thus raises the question of violating the OPFB principle. This article addresses such a question and establishes the PI under an initial stabilizing OPFB policy. We prove that an off-policy Bellman equation can transform any OPFB policy into an FSFB policy. Based on this transformation property, we revise the traditional PI by appending an additional iteration, which turns out to be efficient in approximating the optimal control under the initial OPFB policy. We show the effectiveness of the proposed learning methods through theoretical analysis and a case study.
Read full abstract