Abstract

Traditional reinforcement learning (RL) uses the return, also known as the expected value of cumulative random rewards, for training an agent to learn an optimal policy. However, recent research indicates that learning the distribution over returns has distinct advantages over learning their expected value as seen in different RL tasks. The shift from using the expectation of returns in traditional RL to the distribution over returns in distributional RL has provided new insights into the dynamics of RL. This paper builds on our recent work investigating the quantum approach towards RL. Our work implements the quantile regression (QR) distributional Q learning with a quantum neural network. This quantum network is evaluated in a grid world environment with a different number of quantiles, illustrating its detailed influence on the learning of the algorithm. It is also compared to the standard quantum Q learning in a Markov Decision Process (MDP) chain, which demonstrates that the quantum QR distributional Q learning can explore the environment more efficiently than the standard quantum Q learning. Efficient exploration and balancing of exploitation and exploration are major challenges in RL. Previous work has shown that more informative actions can be taken with a distributional perspective. Our findings suggest another cause for its success: the enhanced performance of distributional RL can be partially attributed to its superior ability to efficiently explore the environment.

Highlights

  • Machine learning is teaching computer models how to learn from data

  • It is compared to the standard quantum Q learning in a Markov Decision Process (MDP) chain, which demonstrates that the quantum quantile regression (QR) distributional Q learning can explore the environment more efficiently than the standard quantum Q learning

  • It seems that the quantum QR distributional Q learning algorithm works better with 3 or 6 quantiles than other options

Read more

Summary

Introduction

Machine learning is teaching computer models how to learn from data. Recent research [4] [5] [6] shows that using the full distribution over random returns can preserve multimodality in the returns and make learning more effective as demonstrated by the state-of-the-art performance in a number of RL benchmarks. Using the full distribution instead of the single scalar of the expected value is more informative when deciding which action to take in RL. Related to this idea, is the introduction of replay memories in deep RL, allowing the agent to leverage previous experiences to break the correlation of training data, which uses a full batch of steps instead of one single step for training

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call