Abstract

Most previous recurrent neural networks for spatiotemporal prediction have difficulty in learning the long-term spatiotemporal correlations and capturing skip-frame correlations. The reason is that the recurrent neural networks update the memory states only using information from the previous time step node and the networks tend to suffer from gradient propagation difficulties. We propose a new framework, KeyMemoryRNN, which has two contributions. The first is that we propose the KeyTranslate Module to extract the most effective historical memory state named keyword state, and we propose the KeyMemory-LSTM which uses the keyword state to update the hidden state to capture the skip-frame correlation. In particular, KeyMemoryLSTM has two training stages. In the second stage, KeyMemoryLSTM adaptively skips the update of sometime step nodes to build a shorter memory information flow to alleviate the difficulty of gradient propagation to learn the long-term spatiotemporal correlations. The second is that both KeyTranslate Module and KeyMemoryLSTM are flexible additional modules, so we can apply them to most RNN-based prediction networks to build KeyMemoryRNN with different base network. The KeyMemoryRNN achieves the state-of-the-art on three spatiotemporal prediction tasks, and we provide ablation studies and memory analysis to verify the effectiveness of KeyMemoryRNN.

Highlights

  • In recent years, deep learning has achieved great success in the fields of computer vision and natural language processing, especially in supervised learning

  • Spatiotemporal prediction learning is to learn the spatiotemporal features from unlabeled spatiotemporal sequence data in an unsupervised manner and using them for subsequent tasks, it is similar to time series prediction [1], [2], except that the data used in spatiotemporal prediction is the data with spatial dimensions

  • What is described above is the dilemma of spatiotemporal prediction networks, in order to solve this dilemma, we observe the process of human beings to capture information and recall, we found that in the process of humans’ rapid capture of information and efficient memory, keyword information plays an important role

Read more

Summary

INTRODUCTION

Deep learning has achieved great success in the fields of computer vision and natural language processing, especially in supervised learning. The clapping action denoted by full video frames as shown in Fig. 1 (a) When this group of picture information is used in the deep RNN-based prediction networks, due to the long-term dilemma mentioned earlier, the recurrent unit relies on short-term information to update the memory state, that is, the memory state only pays attention to the short-term spatiotemporal changes (such as a small swing of the arm) but failed to model long-term spatiotemporal correlation to capture long-term motion context (such as clapping). We will introduce the overall structure of KeyMemoryRNN and the specific structures of KeyTranslate Module and KeyMemoryLSTM in detail

KEY MEMORY RNN
BUILDING BLOCK
EXPERIMENTS
RADAR ECHO
Findings
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.