With the continuous development of deep learning, video frame prediction has become a hotspot in the field of computer vision due to its wide range of applications in anomaly detection, robot decision-making, weather forecasting, and autonomous driving. Although current video frame prediction methods have made remarkable progress, the majority of them directly generate prediction frames by extracting potential spatial distribution patterns from the video data. They lack spatiotemporal information modeling, which leads to high latency, ambiguity, and unrealistic results. In this work, we propose an end-to-end video prediction network model (Generative Differential-Assisted Discriminative Network, abbreviated as GDDNet). It combines the advantages of the difference generation method to extract short-term variations from the image and attention mechanisms to recall global contextual motion information. Furthermore, the differential attention mechanism (DAM) module can guide the model to allocate attention resources more efficiently. These strategies considerably improve the model’s ability to represent motion features in video frames. To further optimize the prediction effect, we introduce adversarial training to enhance the clarity and authenticity of the video frames. In order to ensure the consistency of spatiotemporal distribution between predicted and real frames, we introduce a sequential frame discriminator. Experimental results on the KITTI, UCF-101, and Caltech pedestrian datasets demonstrate the effectiveness of the GDDNet and compare it to the state-of-the-art model. Multi-frame prediction and ablation experiments show that our proposed model not only improves the quality of predictions, but also provides a more flexible prediction framework.
Read full abstract