Abstract

In traditional Reinforcement Learning (RL), agents learn to optimize actions in a dynamic context based on recursive estimation of expected values. We show that this form of machine learning fails when rewards (returns) are affected by tail risk, i.e., leptokurtosis. Here, we adapt a recent extension of RL, called distributional RL (disRL), and introduce estimation efficiency, while properly adjusting for differential impact of outliers on the two terms of the RL prediction error in the updating equations. We show that the resulting “efficient distributional RL” (e-disRL) learns much faster, and is robust once it settles on a policy. Our paper also provides a brief, nontechnical overview of machine learning, focusing on RL.

Highlights

  • Reinforcement Learning (RL) has been successfully applied in diverse domains

  • We prove the superiority of e-distributional RL (disRL) over Temporal Difference (TD) Learning and disRL

  • To disentangle the effect of separating the two terms of the prediction error and the use of efficient estimation of the mean, we proceed in stages, and report results, first, for an estimator that only implements the separation but continues to use the sample average as the estimator of the expected rewards, and second, for an estimator that both separates the components of the TD error and applies efficient estimation when calculating the mean of the empirical distribution of rewards

Read more

Summary

Introduction

Reinforcement Learning (RL) has been successfully applied in diverse domains. the domain of finance remains a challenge. We focus on one version of TD Learning, called SARSA, whereby the agent takes the action in the subsequent trial to be the one deemed optimal given the new state, i.e., the action that provides the maximum estimated Q value given the state. New estimates of the Q values of action-state pairs are obtained by taking the expectation over this empirical distribution. This technique, referred to as Distributional RL (disRL), has been more successful than the traditional, recursive TD Learning, in contexts such as games where the state space is large and the relation states-action values is complex.

Machine Learning
Reinforcement Learning
Our Contribution
TD Learning
Leptokurtosis
Proposed Solution
Environment
Reward
Convergence
Methods
The Gaussian Environment
The Leptokurtic Enviroment I: t-Distribution
Procedure
The Leptokurtic Environment II
Impact of Outlier Risk on Categorical Distributional RL
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call