Video frame interpolation (VFI) is a task that generates intermediate frames from two consecutive frames. Previous studies have employed two main approaches to extract the necessary information from both frames: pixel-level synthesis and flow-based methods. However, when synthesizing high-resolution videos using VFI, each approach has its limitations. Pixel-level synthesis based on the transformer architecture requires high complexity to achieve 4K video results. In the case of flow-based methods, forward warping can produce holes where pixels are not allocated, while backward warping approaches struggle to obtain accurate backward flow. Additionally, there are challenges during the training stage; previous works have often generated suboptimal results by training multi-stage model architectures separately. To address these issues, we propose a Recurrent Flow Update (RFU) model trained in an end-to-end manner. We introduce a global flow update module that leverages global information to mitigate the weaknesses of forward flow and gradually correct errors. We demonstrate the effectiveness of our method through several ablation studies. Our approach achieves state-of-the-art performance not only on the XTest and Davis datasets, which have 4K resolution, but also on the SNU-FILM dataset, which features large motions at low resolution.
Read full abstract