High spatial-temporal resolution plays a vital role in the application of geoscience dynamic observance and prediction. However, thanks to the constraints of technology and budget, it is troublesome for one satellite detector to get high spatial-temporal resolution remote sensing images. Individuals have developed spatiotemporal image fusion technology to resolve this downside, and deep remote sensing images with spatiotemporal resolution have become a possible and efficient answer. Due to the fixed size of the receptive field of convolutional neural networks, the features extracted by convolution operations cannot capture long-range features, so the correlation of global features cannot be modeled in the deep learning process. We propose a spatiotemporal fusion model of remote sensing images to solve these problems based on a dual branch feedback mechanism and texture transformer. The model separates the network from the coarse-fine images with similar structures through the idea of double branches and reduces the dependence of images on time series. It principally merges the benefits of transformer and convolution network and employs feedback mechanism and texture transformer to extract additional spatial and temporal distinction features. The primary function of the transformer module is to learn global temporal correlations and fuse temporal features with spatial features. To completely extract additional elaborated features in several stages, we have a tendency to design a feedback mechanism module. This module chiefly refines the low-level representation through high-level info and obtains additional elaborated features when considering the temporal and spacial characteristics. We have a tendency to receive good results by comparison with four typical spatiotemporal fusion algorithms, proving our model’s superiority and robustness.