Abstract

Video prediction is the challenging task of generating the future frames of a video given a sequence of previously observed frames. This task involves the construction of an internal representation that accurately models the frame evolutions, including contents and dynamics. Video prediction is considered difficult due to the inherent compounding of errors in recursive pixel level prediction. In this paper, we present a novel video prediction system that focuses on regions of interest (ROIs) rather than on entire frames and learns frame evolutions at the transformation level rather than at the pixel level. We provide two strategies to generate high-quality ROIs that contains potential moving visual cues. The frame evolutions are modeled with a transformation generator that produces transformers and masks simultaneously, which are then combined to generate the future frame in a transformation-guided masking procedure. Compared with recent approaches, our system is able to generate more accurate predictions by modeling the visual evolutions at the transformation level rather than at the pixel level. Focusing on ROIs avoids a heavy computational burden and enables our system to generate high-quality long-term future frames without severely amplified signal loss. Moreover, our system is able to generate diverse plausible future frames, which is important in many real-world scenarios. Furthermore, we enable our system to perform video prediction conditioned on a single frame by revising the transformation generator to produce motion-centric transformers. We test our system on four datasets with different experimental settings and demonstrate its advantages over recent methods, both quantitatively and qualitatively.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call