Abstract

AbstractSpatiotemporal sequence prediction is an important problem in deep learning. We study next‐frame(s) video prediction using a deep‐learning‐based predictive coding framework that uses convolutional LSTM (convLSTM) modules. We introduce a novel rgcLSTM architecture that requires a significantly lower parameter budget than a comparable convLSTM. By using a single multifunction gate, our reduced‐gate model achieves equal or better next‐frame(s) prediction accuracy than the original convolutional LSTM while using a smaller parameter budget, thereby reducing training time and memory requirements. We tested our reduced gate modules within a predictive coding architecture on the moving MNIST and KITTI datasets. We found that our reduced‐gate model has a significant reduction of approximately 40% of the total number of training parameters and a 25% reduction in elapsed training time in comparison with the standard convolutional LSTM model. The performance accuracy of the new model was also improved. This makes our model more attractive for hardware implementation, especially on small devices. We also explored a space of 20 different gated architectures to get insight into how our rgcLSTM fits into that space.

Highlights

  • The brain in part acquires representations by using learning mechanisms that are triggered by prediction errors when processing sensory input 1,2,3

  • The training time is reduced by about 40% − 50% as in the moving MNIST prediction task and a smaller Standard Error (SE), mean squared error (MSE), and mean absolute error (MAE) values

  • We found that when these models were substituted into Lotter et al.’s PredNet on the Moving MNIST dataset, the best performing models included our reduced-gate convolutional long short-term memory (LSTM) (rgcLSTM) and Lotter et al.’s convolutional LSTM (cLSTM), according to the MSE, MAE, and structural similarity index measurement (SSIM) performance measures

Read more

Summary

Introduction

The brain in part acquires representations by using learning mechanisms that are triggered by prediction errors when processing sensory input 1,2,3. The lower-level areas compare the predictions with ground truth sensory input and calculate the prediction errors. These are in turn forwarded to the higher-level areas to update their predictive representations in light of the new input information. Recurrent neural networks (RNNs) process sequential data such as occurs in signal processing 9, weather feeds 10, time series 11, and videos 4,8. Spatiotemporal datasets such as video are sequential datasets where the sequence elements are images. Plain RNNs do not have gating mechanisms and suffer from vanishing and/or exploding gradient problems They can learn local sequential dependencies, the gradient issues prevent them from learning long-term dependencies

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call