Abstract
In traditional Reinforcement Learning (RL), agents learn to optimize actions in a dynamic context based on recursive estimation of expected values. We show that this form of machine learning fails when rewards (returns) are affected by tail risk, i.e., leptokurtosis. Here, we adapt a recent extension of RL, called distributional RL (disRL), and introduce estimation efficiency, while properly adjusting for differential impact of outliers on the two terms of the RL prediction error in the updating equations. We show that the resulting “efficient distributional RL” (e-disRL) learns much faster, and is robust once it settles on a policy. Our paper also provides a brief, nontechnical overview of machine learning, focusing on RL.
Highlights
Reinforcement Learning (RL) has been successfully applied in diverse domains
We prove the superiority of e-distributional RL (disRL) over Temporal Difference (TD) Learning and disRL
To disentangle the effect of separating the two terms of the prediction error and the use of efficient estimation of the mean, we proceed in stages, and report results, first, for an estimator that only implements the separation but continues to use the sample average as the estimator of the expected rewards, and second, for an estimator that both separates the components of the TD error and applies efficient estimation when calculating the mean of the empirical distribution of rewards
Summary
Reinforcement Learning (RL) has been successfully applied in diverse domains. the domain of finance remains a challenge. We focus on one version of TD Learning, called SARSA, whereby the agent takes the action in the subsequent trial to be the one deemed optimal given the new state, i.e., the action that provides the maximum estimated Q value given the state. New estimates of the Q values of action-state pairs are obtained by taking the expectation over this empirical distribution. This technique, referred to as Distributional RL (disRL), has been more successful than the traditional, recursive TD Learning, in contexts such as games where the state space is large and the relation states-action values is complex.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have