Abstract

Learning predictive models for unlabeled spatiotemporal data is challenging in part because visual dynamics can be highly entangled, especially in real scenes. In this paper, we refer to the multi-modal output distribution of predictive learning as spatiotemporal modes. We find an experimental phenomenon named spatiotemporal mode collapse (STMC) on most existing video prediction models, that is, features collapse into invalid representation subspaces due to the ambiguous understanding of mixed physical processes. We propose to quantify STMC and explore its solution for the first time in the context of unsupervised predictive learning. To this end, we present ModeRNN, a decoupling-aggregation framework that has a strong inductive bias of discovering the compositional structures of spatiotemporal modes between recurrent states. We first leverage a set of dynamic slots with independent parameters to extract individual building components of spatiotemporal modes. We then perform a weighted fusion of slot features to adaptively aggregate them into a unified hidden representation for recurrent updates. Through a series of experiments, we show high correlation between STMC and the fuzzy prediction results of future video frames. Besides, ModeRNN is shown to better mitigate STMC and achieve the state of the art on five video prediction datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call