Abstract

Video prediction theories have quickly progressed especially after a great revolution of deep learning methods. The prediction architectures based on pixel generation produced a blurry forecast, but it is preferred in many applications because this model is applied on frames only and does not need other support information like segmentation or flow mapping information making getting a suitable dataset very difficult. In this approach, we presented a novel end-to-end video forecasting framework to predict the dynamic relationship between pixels in time and space. The 3D CNN encoder is used for estimating the dynamic motion, while the decoder part is used to reconstruct the next frame based on adding 3DCNN CONVLSTM2D in skip connection. This novel representation of skip connection plays an important role in reducing the blur predicted and preserved the spatial and dynamic information. This leads to an increase in the accuracy of the whole model. The KITTI and Cityscapes are used in training and Caltech is applied in inference. The proposed framework has achieved a better quality in PSNR=33.14, MES=0.00101, SSIM=0.924, and a small number of parameters (2.3 M).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call