Abstract

Most of the research in Reinforcement Learning (RL) focuses on balancing exploration and exploitation. Indeed, the reasons for the success or failure of an RL algorithm often deal with the choice between the execution of exploratory actions and the exploitation of actions that are known to be good. In the context of Multi-Armed Bandits (MABs), many algorithms have addressed this dilemma. In particular, Thompson Sampling (TS) is a solution that, besides having good theoretical properties, usually works very well in practice. Unfortunately, the success of TS in MAB problems has not been replicated in RL, where it has shown to scale very poorly w.r.t. the dimensionality of the problem. Nevertheless, the application of TS in RL, instead of more myopic strategies such as e-greedy, remains a promising solution. This paper addresses such issue proposing several algorithms to use TS in RL and deep RL in a feasible way. We present these algorithms explaining the intuitions and theoretical considerations behind them and discussing their advantages and drawbacks. Furthermore, we provide an empirical evaluation on an increasingly complex set of RL problems, showing the benefit of TS w.r.t. other sampling strategies available in classical and more recent RL literature.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.