Abstract

Spatiotemporal prediction is a challenging topic because of uncertainty. Existing works attempt to design complicated systems to learn short-term and long-term dynamics. However, these models suffer from heavy computational burden on spatiotemporal data, especially for high-resolution frames. To reduce resource dependence, we propose FastNet, a novel and light encoder–decoder for predictive learning. We stack four ConvLSTM (Convolutional Long Short-Term Memory) based layers to construct a hierarchical framework. Based on this architecture, the feature aggregation module first aligns the temporal context, then decouples different frequency information, next gathers multi-level features and last synthesizes new feature maps alternately. We aggregate various and hierarchical features into predictions bringing two benefits: rich multi-level features and low resource usage. As for the unit blocks, depth-wise separable convolution is used to improve model efficiency and compress model size. Besides, we adopt perceptual loss as the cost function between ground truths and predictions, which helps our model to get higher similarity with true frames. In experiments, we evaluate the performances of FastNet on the MovingMNIST (Mixed National Institute of Standards and Technology) and Radar Echo datasets to verify its effectiveness. The quantitative metrics on the Radar Echo dataset show that FastNet reaches a slight increase in accuracy but up to 84% decline in computation compared with PredRNN-V2. Therefore, our FastNet achieves competitive results with lower resource usage and fewer parameters than the state-of-the-art model.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call