Super-Resolving (SR) video is more challenging compared with image super-resolution because of the demanding computation time. To enlarge a low-resolution video, the temporal relationship among frames must be fully exploited. We can model video SR as a multi-frame SR problem and use deep learning methods to estimate the spatial and temporal information. This paper proposes a lighter residual network, based on a multi-stage back projection for multi-frame SR. We improve the back projection based residual block by adding weights for adaptive feature tuning, and add global & local connections to explore deeper feature representation. We jointly learn spatial-temporal feature maps by using the proposed Spatial Convolution Packing scheme as an attention mechanism to extract more information from both spatial and temporal domains. Different from others, our proposed network can input multiple low-resolution frames to obtain multiple super-resolved frames simultaneously. We can then further improve the video SR quality by self-ensemble enhancement to meet videos with different motions and distortions. Results of much experimental work show that our proposed approaches give large improvement over other state-of-the-art video SR methods. Compared to recent CNN based video SR works, our approaches can save, up to 60% computation time and achieve 0.6 dB PSNR improvement.