Video super-resolution aims at restoring the spatial resolution of the reference frame based on consecutive input low-resolution (LR) frames. Existing implicit alignment-based video super-resolution methods commonly utilize convolutional LSTM (ConvLSTM) to handle sequential input frames. However, vanilla ConvLSTM processes input features and hidden states independently in operations and has limited ability to handle the inter-frame temporal redundancy in low-resolution fields. In this paper, we propose a multi-stage spatio-temporal adaptive network (MS-STAN). A spatio-temporal adaptive ConvLSTM (STAC) module is proposed to handle input features in low-resolution fields. The proposed STAC module utilizes the correlation between input features and hidden states in the ConvLSTM unit and modulates the hidden states adaptively conditioned on fused spatio-temporal features. A residual stacked bidirectional (RSB) architecture is further proposed to fully exploit the processing ability of the STAC unit. The proposed STAC and RSB architecture promote the vanilla ConvLSTM’s ability to exploit the inter-frame correlations, thus improving the reconstruction quality. Furthermore, different from existing methods that only aggregate features from the temporal branch once at a specified stage of the network, the proposed network is organized in a multi-stage manner. The corresponding temporal correlation in features at different stages can be fully exploited. Experimental results on Vimeo-90K-T and UMD10 datasets show that the proposed method has comparable performance with current video super-resolution methods. The code is available at https://github.com/yhjoker/MS-STAN.
Read full abstract