Abstract

Considering the inherent stochasticity and uncertainty, predicting future video frames is exceptionally challenging. In this work, we study the problem of video prediction by combining interpretability of stochastic state space models and representation learning of deep neural networks with channel-attention. Our model which based on a variational encoder introduces a novel channel-attention block to enhance the ability to capture dynamic objects in video frames and a Luenberger-type observer which catches the dynamic evolution of the latent features. These enable the features of moving objects more conspicuous and decomposition of videos into static features and dynamics in an unsupervised manner. By deriving the stability theory of the nonlinear Luenberger-type observer, the hidden states in the feature space become insensitive with respect to the initial values, which improves the robustness of the overall model. Furthermore, the variational lower bound on the data log-likelihood can be derived to obtain the tractable posterior prediction distribution based on the variational principle. Finally, the experiments including the Rotaton Numbers dataset and the Kth dataset are provided to demonstrate the proposed model outperforms concurrent works.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call