On Improving the Learning of Long-Term historical Information for Tasks with Partial Observability

Xinwen Wang,Xin Li,Linjing Lai

doi:10.1109/dsc50466.2020.00042

Xinwen Wang, Xin Li + Show 1 more

https://doi.org/10.1109/dsc50466.2020.00042

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Reinforcement learning (RL) has been recognized as the powerful tool to handle many real-work tasks of decision making, data mining and, information retrieval. Many well-developed RL algorithms have been developed, however tasks involved with partially observable environment, e.g, POMDPs (Partially Observable Markov Decision Processes) are still very challenging. Recent attempts to address this issue is to memorize the long-term historical information by using deep neural networks. And the common strategy is to leverage the recurrent networks, e.g., Long Short-Term Memory(LSTM), to retain/encode the historical information to estimate the true state of environments, given the partial observability. However, when confronted with rather long history dependent problems and irregular data sampling, the conventional LSTM is ill-suited for the problem and difficult to be trained due to the well-known gradient vanishing and the inadequacy of capturing long-term history. In this paper, we propose to utilize Phased LSTM to solve the POMDP tasks, which introduces an additional time gate to periodically update the memory cell, helping the neural framework to 1) maintain the information of the long-term, 2) and propagate the gradient better to facilitate the training of reinforcement learning model with recurrent structure. To further adapt to reinforcement learning and boost the performance, we also propose a Self-Phased LSTM with incorporating a periodic gate, which is able to generate a dynamic periodic gate to adjust automatically for more tasks, especially the notorious ones with sparse rewards. Our experimental results verify the effectiveness of leveraging on such Phased LSTM and Self-Phased LSTM for POMDP tasks.

Full Text