Abstract

The applicability of model-based online reinforcement learning algorithms is often limited by the amount of exploration required for learning the environment model to the desired level of accuracy. A promising approach to addressing this issue is to exploit side information, available either <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">a priori</i> or during the agent’s mission, for learning the unknown dynamics. Side information in our context refers to information in the form of bounds on the differences between transition probabilities at different states in the environment. We use this information as a measure of reusability of the direct experience gained by performing actions and observing the outcomes at different states. We propose a framework to integrate side information into existing model-based reinforcement learning algorithms by complementing the samples obtained directly at states with second-hand information obtained from other states with similar dynamics. Additionally, we propose an algorithm for synthesizing the optimal control strategy in unknown environments by using side information to effectively balance between exploration and exploitation. We prove that, with high probability, the proposed algorithm yields a near-optimal policy in the Bayesian sense, while also guaranteeing the safety of the agent during exploration. We obtain the near-optimal policy in time steps that are polynomial in terms of the parameters describing the model. We illustrate the utility of the proposed algorithms in a setting of a Mars rover, with data from onboard sensors and a companion aerial vehicle acting as the side information.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call