Video prediction has emerged as a critical task in the field of computer vision. While conventional deep learning-based video prediction models excel in accuracy, they often impose significant computational overhead. Moreover, they tend to disregard crucial factors like inference latency and memory consumption, which may render them unsuitable for real-time video prediction applications. This issue is further exacerbated by the necessity to deploy many of these applications on resource-limited embedded devices. In this study, we introduce an innovative faster yet accurate video prediction (FAVP) model designed to mitigate inference latency and memory footprint, all the while maintaining predictive accuracy. Our model employs an encoder-convertor-decoder architecture with a novel multi-layer convertor. Each layer within the convertor uses consistent kernel sizes to precisely capture dimensional features, while varying kernel sizes between layers effectively capture both local and global features. This combination of uniformity within layers and diversity between layers enhances the model’s ability to grasp dynamic changes. Furthermore, we demonstrate that replacing conventional large-kernel convolutions with involutions significantly trims the model’s parameter count and inference latency, without compromising prediction accuracy. Through comprehensive experiments conducted on resource-adequate x86 platforms, utilizing KTH, TrafficBJ, Human3.6M, Moving MNIST, and KTH-Enhanced datasets, we showcase our model’s exceptional prediction performance, particularly in terms of inference latency. Additionally, we evaluate the model on resource-constrained NVIDIA Jetson Nano platform using KITTI & CalTech Pedestrian dataset. The results underscore our model’s superior inference speed and its competence in meeting real-time video prediction requirements, even under stringent resource limitations.
Read full abstract