This paper proposes a novel spatiotemporal fusion model for generating images with high-spatial and high-temporal resolution (HSHT) through learning with only one pair of prior images. For this purpose, this method establishes correspondence between low-spatial-resolution but high-temporal-resolution (LSHT) data and high-spatial-resolution but low-temporal-resolution (HSLT) data through the superresolution of LSHT data and further fusion by using high-pass modulation. Specifically, this method is implemented in two stages. In the first stage, the spatial resolution of LSHT data on prior and prediction dates is improved simultaneously by means of sparse representation; in the second stage, the known HSLT and the superresolved LSHTs are fused via high-pass modulation to generate the HSHT data on the prediction date. Remarkably, this method forms a unified framework for blending remote sensing images with temporal reflectance changes, whether phenology change (e.g., seasonal change of vegetation) or land-cover-type change (e.g., conversion of farmland to built-up area) based on a two-layer spatiotemporal fusion strategy due to the large spatial resolution difference between HSLT and LSHT data. This method was tested on both a simulated data set and two actual data sets of Landsat Enhanced Thematic Mapper Plus-Moderate Resolution Imaging Spectroradiometer acquisitions. It was also compared with other well-known spatiotemporal fusion algorithms on two types of data: images primarily with phenology changes and images primarily with land-cover-type changes. Experimental results demonstrated that our method performed better in capturing surface reflectance changes on both types of images.