Abstract

Reinforcement Learning (RL) enables an agent to learn control policies for achieving its long-term goals. One key parameter of RL algorithms is a discount factor that scales down future cost in the state’s current value estimate. This study introduces and analyses a transition-based discount factor in two model-free reinforcement learning algorithms: Q-learning and SARSA, and shows their convergence using the theory of stochastic approximation for finite state and action spaces. This causes an asymmetric discounting, favouring some transitions over others, which allows (1) faster convergence than constant discount factor variant of these algorithms, which is demonstrated by experiments on the Taxi domain and MountainCar environments; (2) provides better control over the RL agents to learn risk-averse or risk-taking policy, as demonstrated in a Cliff Walking experiment.

Highlights

  • Reinforcement Learning (RL) algorithms train a software agent to determine a policy, which can solve an RL problem in the most efficient way possible

  • We have provided a way to set the discount factor, which can be used in tasks where there are finite state and action spaces, and a positive reward is provided at the goal state

  • After defining everything necessary about Q-learning with the adaptive discount factor, we present the convergence of this algorithm

Read more

Summary

Introduction

RL algorithms train a software agent to determine a policy, which can solve an RL problem in the most efficient way possible. Inspired by Wei [22], Yoshida gave the convergence results of the Q-learning algorithm for the state-dependent discount factor [30] These studies directly investigate a discount factor’s role for various cases by using both a test problem approach and a general function approximation case. There is a benefit of using transition-based discount over constant one as the former provides a general unified theory for both episodic and nonepisodic tasks This setting of the discount factor has not been studied yet for Q-learning and SARSA algorithms. The novelty of our work is the introduction of discount factor as a function of transition variables (current state, action and state) to the two model-free reinforcement learning algorithms mentioned above and provide convergence results for the same.

Background
Q-Learning
Transition Dependent Discount Factor
Q-Learning with Transition-Based Discounts
Experiments
Mountain Car
Discussion and Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call