Vision Transformers (ViTs) have shown success in many low-level computer vision tasks. However, existing ViT models are limited by their high computation and memory cost when generating high-resolution videos for tasks like video prediction. This paper presents a scalable video transformer for full-frame video predication. Specifically, we design a backbone transformer block for our video transformer. This transformer block decouples the temporal and channel features to reduce the computation cost when processing large-scale spatial–temporal video features. We use transposed attention to focus on the channel dimension instead of the spatial window to further reduce the computation cost. We also design a Global Shifted Multi-Dconv Head Transposed Attention module (GSMDTA) for our transformer block. This module is built upon two key ideas. First, we design a depth shift module to better incorporate the cross-channel or temporal information from video features. Second, we introduce a global query mechanism to capture global information to handle large motion for video prediction. This new transformer block enables our video transformer to predict a full frame from multiple past frames at the resolution of 1024 × 512 with 12 GB VRAM. Experiments on various video prediction benchmarks demonstrate that our method with only RGB input outperforms state-of-the-art methods that require additional data, like segmentation maps and optical flows. Our method exceeds the state-of-the-art RGB-only methods by a large margin (1.2 dB) in PSNR. Our method is also faster than state-of-the-art video prediction transformers.
Read full abstract