Over the past decade, spatiotemporal fusion has become an indispensable tool for monitoring land surface dynamics due to its promising ability to produce surface reflectance products with both high spatial and temporal resolutions. However, existing fusion methods usually generate multispectral band products by predicting each spectral band separately, so the useful information of spectral autocorrelation within the spectrum has been ignored and waits to be exploited. To address this issue, we propose a novel spatiotemporal fusion method, the spatiotemporal Fusion Incorrporting Spectral autocorrelaTion (FIRST) model, to fully utilize the multiple spectral bands of surface reflectance products. Compared with other fusion methods, the model has three distinct advantages: (1) it utilizes spectral autocorrelation in a many-to-many regression framework that simultaneously inputs and predicts multispectral bands without the collinearity effect; (2) it maintains high fusion accuracy when the spatiotemporal variation is large with acceptable computational efficiency; and (3) it can produce robust results even with input images contaminated by haze and thin clouds. We tested the FIRST model at several experimental sites and compared it with four typical methods, the Spatial and Temporal Adaptive Reflectance Fusion Model (STARFM), Flexible Spatiotemporal DAta Fusion (FSDAF) model, the regression model Fitting, spatial Filtering and residual Compensation (Fit-FC) model and the enhanced STARFM (ESTARFM). The results demonstrate that FIRST yields better overall performance for its simple and effective technical principles. FIRST is thus expected to provide high-quality remotely sensed data with high spatial resolution and frequent observations for various applications.