Exploiting Action-Value Uncertainty to Drive Exploration in Reinforcement Learning

Carlo D'Eramo,Marcello Restelli,Andrea Cini

doi:10.1109/ijcnn.2019.8852326

Carlo D'Eramo, Marcello Restelli + Show 1 more

https://doi.org/10.1109/ijcnn.2019.8852326

Copy DOI

Export

Save

Cite

Publication Date: Jul 1, 2019

Citations: 7

Affiliation: Politecnico di Milano

Abstract
Full-Text
Similar Papers

Abstract

Listen

Most of the research in Reinforcement Learning (RL) focuses on balancing exploration and exploitation. Indeed, the reasons for the success or failure of an RL algorithm often deal with the choice between the execution of exploratory actions and the exploitation of actions that are known to be good. In the context of Multi-Armed Bandits (MABs), many algorithms have addressed this dilemma. In particular, Thompson Sampling (TS) is a solution that, besides having good theoretical properties, usually works very well in practice. Unfortunately, the success of TS in MAB problems has not been replicated in RL, where it has shown to scale very poorly w.r.t. the dimensionality of the problem. Nevertheless, the application of TS in RL, instead of more myopic strategies such as e-greedy, remains a promising solution. This paper addresses such issue proposing several algorithms to use TS in RL and deep RL in a feasible way. We present these algorithms explaining the intuitions and theoretical considerations behind them and discussing their advantages and drawbacks. Furthermore, we provide an empirical evaluation on an increasingly complex set of RL problems, showing the benefit of TS w.r.t. other sampling strategies available in classical and more recent RL literature.

Full Text