Abstract

The first part of a two-part series of papers provides a survey on recent advances in Deep Reinforcement Learning (DRL) applications for solving partially observable Markov decision processes (POMDP) problems. Reinforcement Learning (RL) is an approach to simulate the human’s natural learning process, whose key is to let the agent learn by interacting with the stochastic environment. The fact that the agent has limited access to the information of the environment enables AI to be applied efficiently in most fields that require self-learning. Although efficient algorithms are being widely used, it seems essential to have an organized investigation—we can make good comparisons and choose the best structures or algorithms when applying DRL in various applications. In this overview, we introduce Markov Decision Processes (MDP) problems and Reinforcement Learning and applications of DRL for solving POMDP problems in games, robotics, and natural language processing. A follow-up paper will cover applications in transportation, communications and networking, and industries.

Highlights

  • We focus applications generally based on Deep Reinforcement Learning, on partially observable Markov decision processes (POMDP) problems

  • Interpretable Reinforcement Learning (PIRL), to generate interpretable agent policies and a new method called Neurally Directed Program Search (NDPS), to find a programmatic policy with maximal reward; In addition to the works shown above, some techniques discussed by Shao et al [59]: Sharma et al [78] proposed Fine Grained Action Repetition (FiGAR) to improve Deep deterministic policy gradient (DPG) (DDPG); Gao et al [79] used The Open Racing Car Simulator (TORCS) to evaluate Normalized Actor-Critic (NAC); Mazumder et al

  • For more surveys in Robotics, see [111] in 2009, a study of robot learning from demonstration (LfD), with which a policy is learned from demonstrations, provided by a teacher; Deisenroth [112] made a survey on policy search for robotics in 2013; In 2014, Kober and Peters [113] provided a general survey on RL in robotics; Tai et al [114] presented a comprehensive survey for learning control in robotics from reinforce to imitation in 2018

Read more

Summary

Markov Decision Processes

Markov Decision Process is a general mathematical problem representing an optimal path of sequential decisions in an uncertain environment. According to the current state, some rewards are available to get either positive gains or negative costs Another characteristic of MDP is the uncertainty in the consequential state regarding the action taken. The expected utility following policy π from state s is the state value function Vπ (s) of the policy, which is not random: Vπ (s) = E[U π (s)] = E[ ∑ γt R(st )]. State-action value function Qπ (s, a), called Q-value, of a policy is the expected utility of taking action a from state s, following policy π:. When it is not in the end state, the value is equal to the Q-value of the policy This yields the Bellman Equation: Vπ (s) =.

Partially Observable Markov Decision Processes
Reinforcement Learning
RL Algorithms
Deep Reinforcement Learning
Value-Based Algorithms
Policy-Based Algorithms
Actor-Critic Algorithms
Applications
Board Games
Card Games
Video Games
Robotics
Manipulation
Locomotion
Robotics Simulators
Natural Language Processing
Neural Machine Translation
Dialogue
Visual Dialogue
Summary
Final Thoughts
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call