Deep Learning Approaches to Predict Future Frames in Videos
Deep neural networks are becoming central in several areas of computer vision. While there have been a lot of studies regarding the classification of images and videos, future frame prediction is still a rarely investigated approach, and even some applications could make good use of the knowledge regarding the next frame of an image sequence in pixel-space. Examples include video compression and autonomous agents in robotics that have to act in natural environments. Learning how to forecast the future of an image sequence requires the system to understand and efficiently encode the content and dynamics for a certain period. It is viewed as a promising avenue from which even supervised tasks could benefit since labeled video data is limited and hard to obtain. Therefore, this work gives an overview of scientific advances covering future frame prediction and proposes a recurrent network model which utilizes recent techniques from deep learning research. The presented architecture is based on the recurrent decoder-encoder framework with convolutional cells, which allows the preservation of Spatio-temporal data correlations. Driven by perceptual-motivated objective functions and a modern recurrent learning strategy, it can outperform existing approaches concerning future frame generation in several video content types. All this can be achieved with fewer training iterations and model parameters.
- Research Article
5
- 10.1002/rob.22135
- Dec 19, 2022
- Journal of Field Robotics
High latency in teleoperation has a significant negative impact on operator performance. While deep learning has revolutionized many domains recently, it has not previously been applied to teleoperation enhancement. We propose a novel approach to predict video frames deep into the future using neural networks informed by synthetically generated optical flow information. This can be employed in teleoperated robotic systems that rely on video feeds for operator situational awareness. We have used the image‐to‐image translation technique as a basis for the prediction of future frames. The Pix2Pix conditional generative adversarial network (cGAN) has been selected as a base network. Optical flow components reflecting real‐time control inputs are added to the standard RGB channels of the input image. We have experimented with three data sets of 20,000 input images each that were generated using our custom‐designed teleoperation simulator with a 500‐ms delay added between the input and target frames. Structural Similarity Index Measures (SSIMs) of 0.60 and Multi‐SSIMs of 0.68 were achieved when training the cGAN with three‐channel RGB image data. With the five‐channel input data (incorporating optical flow) these values improved to 0.67 and 0.74, respectively. Applying Fleiss' κ gave a score of 0.40 for three‐channel RGB data, and 0.55 for five‐channel optical flow‐added data. We are confident the predicted synthetic frames are of sufficient quality and reliability to be presented to teleoperators as a video feed that will enhance teleoperation. To the best of our knowledge, we are the first to attempt to reduce the impacts of latency through future frame prediction using deep neural networks.
- Book Chapter
10
- 10.1007/978-3-030-78191-0_41
- Jan 1, 2021
Predicting future frames for robotic surgical video is an interesting, important yet extremely challenging problem, given that the operative tasks may have complex dynamics. Existing approaches on future prediction of natural videos were based on either deterministic models or stochastic models, including deep recurrent neural networks, optical flow, and latent space modeling. However, the potential in predicting meaningful movements of robots with dual arms in surgical scenarios has not been tapped so far, which is typically more challenging than forecasting independent motions of one arm robots in natural scenarios. In this paper, we propose a ternary prior guided variational autoencoder (TPG-VAE) model for future frame prediction in robotic surgical video sequences. Besides content distribution, our model learns motion distribution, which is novel to handle the small movements of surgical tools. Furthermore, we add the invariant prior information from the gesture class into the generation process to constrain the latent space of our model. To our best knowledge, this is the first time that the future frames of dual arm robots are predicted considering their unique characteristics relative to general robotic videos. Experiments demonstrate that our model gains more stable and realistic future frame prediction scenes with the suturing task on the public JIGSAWS dataset.KeywordsVideo prediction for medical roboticsDeep learning for visual perceptionMedical robots and systems
- Research Article
25
- 10.1109/access.2021.3100678
- Jan 1, 2021
- IEEE Access
Anomaly detection in videos is the task of identifying frames from a video sequence that depict events that do not conform to expected behavior, which is an extremely challenging task due to the ambiguous and unbounded properties of anomalies. With the development of deep learning, video anomaly detection methods based on deep neural networks have made great progress. The existing methods mainly follow two routes, namely, frame reconstruction and frame prediction. Due to the powerful generalization ability of neural networks, the application of reconstruction-based methods is limited. Recently, anomaly detection methods based on prediction have achieved advanced performance. However, their performance suffers when they cannot guarantee lower prediction errors for normal events. In this paper, we propose a novel future frame prediction model based on a bidirectional retrospective generation adversarial network (BR-GAN) for anomaly detection. To predict a future frame with higher quality for normal events, first, we propose a bidirectional prediction combined with a retrospective prediction method to fully mine the bidirectional temporal information between the predicted frame and the input frame sequence. Then, the intensity and gradient loss between the predicted frame and the actual frame together with an adversarial loss are used for appearance (spatial) constraints. In addition, we propose a sequence discriminator composed of a 3-dimensional (3D) convolutional neural network to capture the long-term temporal relationships between frame sequences composed of predicted frames and input frames; this network plays a crucial role in maintaining the motion (temporal) consistency of the predicted frames for normal events. Such appearance and motion constraints further facilitate future frame prediction for normal events, and thus, the prediction network can be highly capable of distinguishing normal and abnormal patterns. Extensive experiments on benchmark datasets demonstrate that our method outperforms most existing state-of-the-art methods, validating the effectiveness of our method for anomaly detection.
- Research Article
224
- 10.1016/j.patrec.2019.11.024
- Nov 16, 2019
- Pattern Recognition Letters
Integrating prediction and reconstruction for anomaly detection
- Conference Article
13
- 10.1109/ccwc51732.2021.9375909
- Jan 27, 2021
With the development of deep learning technology, a large number of new technologies for video anomaly detection have emerged. This paper proposes a video anomaly detection algorithm based on the future frame prediction using Generative Adversarial Network (GAN) and attention mechanism. For the generation model, a U-Net model, is modified and added with an attention module. For the discrimination model, a Markov GAN discrimination model with self-attention mechanism is proposed, which can affect the generator and improve the generation quality of the future video frame. Experiments show that the new video anomaly detection algorithm improves the detection performance, and the attention module plays an important role in the overall detection performance. It is found that the more the attention modules are appliedthe deeper the application level is, the better the detection effect is, which also verifies the rationality of the model structure used in this project.
- Research Article
1
- 10.1049/el.2019.2376
- Sep 1, 2019
- Electronics Letters
GenSyth: a new way to understand deep learning
- Conference Article
2
- 10.1109/icip42928.2021.9506508
- Sep 19, 2021
In this paper, we focus on the problem of video prediction, i.e., future frame prediction. Most state-of-the-art techniques focus on synthesizing a single future frame at each step. However, this leads to utilizing the model’s own predicted frames when synthesizing multi-step prediction, resulting in gradual performance degradation due to accumulating errors in pixels. To alleviate this issue, we propose a model that can handle multi-step prediction. Additionally, we employ techniques to leverage from view synthesis for future frame prediction, where both problems are treated independently in the literature. Our proposed method employs multiview camera pose prediction and depth-prediction networks to project the last available frame to desired future frames via differentiable point cloud renderer. For the synthesis of moving objects, we utilize an additional refinement stage. In experiments, we show that the proposed framework outperforms state-of-theart methods in both KITTI and Cityscapes datasets.
- Conference Article
342
- 10.1109/iccv.2017.194
- Oct 1, 2017
Future frame prediction in videos is a promising avenue for unsupervised video representation learning. Video frames are naturally generated by the inherent pixel flows from preceding frames based on the appearance and motion dynamics in the video. However, existing methods focus on directly hallucinating pixel values, resulting in blurry predictions. In this paper, we develop a dual motion Generative Adversarial Net (GAN) architecture, which learns to explicitly enforce future-frame predictions to be consistent with the pixel-wise flows in the video through a dual-learning mechanism. The primal future-frame prediction and dual future-flow prediction form a closed loop, generating informative feedback signals to each other for better video prediction. To make both synthesized future frames and flows indistinguishable from reality, a dual adversarial training method is proposed to ensure that the future-flow prediction is able to help infer realistic future-frames, while the future-frame prediction in turn leads to realistic optical flows. Our dual motion GAN also handles natural motion uncertainty in different pixel locations with a new probabilistic motion encoder, which is based on variational autoencoders. Extensive experiments demonstrate that the proposed dual motion GAN significantly outperforms state-of-the-art approaches on synthesizing new video frames and predicting future flows. Our model generalizes well across diverse visual scenes and shows superiority in unsupervised video representation learning.
- Research Article
2
- 10.21271/zjpas.34.2.3
- Apr 12, 2022
- ZANCO JOURNAL OF PURE AND APPLIED SCIENCES
Comprehensive Study for Breast Cancer Using Deep Learning and Traditional Machine Learning
- Research Article
5
- 10.1016/j.asoc.2023.110028
- Jan 18, 2023
- Applied Soft Computing
Future video frame prediction based on generative motion-assistant discriminative network
- Conference Article
325
- 10.1109/icip.2017.8297014
- Sep 1, 2017
Research in deep neural network (DNN) and deep learning has great progress for 1D (speech), 2D (image) and 3D (3D-object) recognition/classification problems. As HSI that with 2D spatial and 1D spectral information is quite different from 3D object image, the existing DNN cannot be directly extended to hyperspectral image (HSI) classification. A Multiscale 3D deep convolutional neural network (M3D-DCNN) is proposed for HSI classification, which could jointly learn both 2D Multi-scale spatial feature and 1D spectral feature from HSI data in an end-to-end approach, promising to achieve better results with large-scale dataset. Although without any hand-craft features or pre/post-processing like PCA, sparse coding etc, we achieve the state-of-the-art results on the standard datasets, which shows the technical validity and advancement of our method.
- Conference Article
74
- 10.1145/3474085.3475693
- Oct 17, 2021
Detecting abnormal activities in real-world surveillance videos is an important yet challenging task as the prior knowledge about video anomalies is usually limited or unavailable. Despite that many approaches have been developed to resolve this problem, few of them can capture the normal spatio-temporal patterns effectively and efficiently. Moreover, existing works seldom explicitly consider the local consistency at frame level and global coherence of temporal dynamics in video sequences. To this end, we propose Convolutional Transformer based Dual Discriminator Generative Adversarial Networks (CT-D2GAN) to perform unsupervised video anomaly detection. Specifically, we first present a convolutional transformer to perform future frame prediction. It contains three key components, i.e., a convolutional encoder to capture the spatial information of the input video clips, a temporal self-attention module to encode the temporal dynamics, and a convolutional decoder to integrate spatio-temporal features and predict the future frame. Next, a dual discriminator based adversarial training procedure, which jointly considers an image discriminator that can maintain the local consistency at frame-level and a video discriminator that can enforce the global coherence of temporal dynamics, is employed to enhance the future frame prediction. Finally, the prediction error is used to identify abnormal video frames. Thoroughly empirical studies on three public video anomaly detection datasets, i.e., UCSD Ped2, CUHK Avenue, and Shanghai Tech Campus, demonstrate the effectiveness of the proposed adversarial spatio-temporal modeling framework.
- Book Chapter
- 10.1007/978-3-031-26293-7_34
- Jan 1, 2023
Motions in videos are often governed by physical and biological laws such as gravity, collisions, flocking, etc. Accounting for such natural properties is an appealing way to improve realism in future frame video prediction. Nevertheless, the definition and computation of intricate physical and biological properties in motion videos are challenging. In this work, we introduce PhyLoNet, a PhyDNet extension that learns long-term future frame prediction and manipulation. Similar to PhyDNet, our network consists of a two-branch deep architecture that explicitly disentangles physical dynamics from complementary information. It uses a recurrent physical cell (PhyCell) for performing physically-constrained prediction in latent space. In contrast to PhyDNet, PhyLoNet introduces a modified encoder-decoder architecture together with a novel relative flow loss. This enables a longer-term future frame prediction from a small input sequence with higher accuracy and quality. We have carried out extensive experiments, showing the ability of PhyLoNet to outperform PhyDNet on various challenging natural motion datasets such as ball collisions, flocking, and pool games. Ablation studies highlight the importance of our new components. Finally, we show an application of PhyLoNet for video manipulation and editing by a novel class label modification architecture.
- Conference Article
2
- 10.1109/vcip47243.2019.8965824
- Dec 1, 2019
In the field of autonomous driving, training an agent to watch and think similar to human drivers is an efficient way to solve self-driving problems. Inspired by NVIDIA’s frame-level command generation task [2] and the discovery of humans memory capacity [13], we propose a future frame prediction method for vehicle-centric driving videos. An end-to-end deep learning architecture called future frame prediction (FFPRE) network is proposed, which can generate a future frame following the input video sequence. In particular, we develop a general memory preserving module to extract meaningful history information from input data. This module consists of two parts, namely, memory recall and memory refine. We train this module to generate the short-term spatiotemporal information of a given video batch, which is a concatenation of history appearance and temporal clues. Thereafter, the two history clues will be transformed into future representations by a long-term prediction module. Thus, humans’ driving prediction progress is mimicked in a completely modular manner. Given the FFPRE network’s effective long-short spatiotemporal feature learning ability, the proposed network can construct an internal representation (content and dynamic) of vehicle-centric driving videos without tracking the trajectory of every pixel. Experimental results on publicly released datasets of NVIDIA and DR(eye)VE indicate that our proposed method is efficient.
- Research Article
20
- 10.1109/access.2020.2995187
- Jan 1, 2020
- IEEE Access
Meteorological imagery prediction is an important and challenging problem for weather forecasting. It can also be seen as a video frame prediction problem that estimates future frames based on observed meteorological imageries. Despite it is a widely-investigated problem, it is still far from being solved. Current state-of-the-art deep learning based approaches mainly optimise the mean square error loss resulting in blurry predictions. We address this problem by introducing a Meteorological Predictive Learning GAN model (in short MPL-GAN) that utilises the conditional GAN along with the predictive learning module in order to handle the uncertainty in future frame prediction. Experiments on a real-world dataset demonstrate the superior performance of our proposed model. Our proposed model is able to map the blurry predictions produced by traditional mean square error loss based predictive learning methods back to their original data distributions, hence it is able to improve and sharpen the prediction. In particular, our MPL-GAN achieves an average sharpness of 52.82, which is 14% better than the baseline method. Furthermore, our model correctly detects the meteorological movement patterns that traditional unconditional GANs fail to do.
- Research Article
1
- 10.3991/ijes.v10i04.35295
- Dec 7, 2022
- International Journal of Recent Contributions from Engineering, Science & IT (iJES)
- Research Article
- 10.3991/ijes.v10i04.35023
- Dec 7, 2022
- International Journal of Recent Contributions from Engineering, Science & IT (iJES)
- Research Article
4
- 10.3991/ijes.v10i04.35163
- Dec 7, 2022
- International Journal of Recent Contributions from Engineering, Science & IT (iJES)
- Research Article
- 10.3991/ijes.v10i03.33893
- Nov 4, 2022
- International Journal of Recent Contributions from Engineering, Science & IT (iJES)
- Research Article
4
- 10.3991/ijes.v10i03.34057
- Nov 4, 2022
- International Journal of Recent Contributions from Engineering, Science & IT (iJES)
- Research Article
3
- 10.3991/ijes.v10i03.35059
- Nov 4, 2022
- International Journal of Recent Contributions from Engineering, Science & IT (iJES)
- Research Article
- 10.3991/ijes.v10i03.34317
- Nov 4, 2022
- International Journal of Recent Contributions from Engineering, Science & IT (iJES)
- Research Article
- 10.3991/ijes.v10i03.34375
- Nov 4, 2022
- International Journal of Recent Contributions from Engineering, Science & IT (iJES)
- Research Article
3
- 10.3991/ijes.v10i02.29735
- Jun 22, 2022
- International Journal of Recent Contributions from Engineering, Science & IT (iJES)
- Research Article
2
- 10.3991/ijes.v10i02.29301
- Jun 22, 2022
- International Journal of Recent Contributions from Engineering, Science & IT (iJES)
- Ask R Discovery
- Chat PDF