Abstract

As a pixel-wise dense forecast task, video prediction is challenging due to its high computation complexity, dramatic future uncertainty, and extremely complicated spatial–temporal patterns. Many deep learning methods are proposed for the task, which bring up significant improvements. However, they focus on modeling short-term spatial–temporal dynamics and fail to sufficiently exploit long-term ones. As a result, the methods tend to deliver unsatisfactory performance for a long-term forecast requirement. In this article, we propose a novel unified memory network (UNIMEMnet) for long-term video prediction, which can effectively exploit long-term motion-appearance dynamics and unify the short-term spatial–temporal dynamics and long-term ones in an architecture. In the UNIMEMnet, a dual branch multi-scale memory module is carefully designed to extract and preserve long-term spatial–temporal patterns. In addition, a short-term spatial–temporal dynamics module and an alignment and fusion module are devised to capture and coordinate short-term motion-appearance dynamics with long-term ones from our designed memory module. Extensive experiments on five video prediction datasets from both synthetic and real-world scenarios are conducted, which validate the effectiveness and superiority of our proposed method UNIMEMnet over state-of-the-art methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call