Soil salinization is one of the primary factors contributing to land degradation in arid areas, severely restricting the sustainable development of agriculture and the economy. Satellite remote sensing is essential for real-time, large-scale soil salinity content (SSC) evaluation. However, some satellite images have low temporal resolution and are affected by weather conditions, leading to the absence of satellite images synchronized with ground observations. Additionally, some high-temporal-resolution satellite images have overly coarse spatial resolution compared to ground features. Therefore, the limitations of these spatiotemporal features may affect the accuracy of SSC evaluation. This study focuses on the arable land in the Manas River Basin, located in the arid areas of northwest China, to explore the potential of integrated spatiotemporal data fusion and deep learning algorithms for evaluating SSC. We used the flexible spatiotemporal data fusion (FSDAF) model to merge Landsat and MODIS images, obtaining satellite fused images synchronized with ground sampling times. Using support vector regression (SVR), random forest (RF), and convolutional neural network (CNN) models, we evaluated the differences in SSC evaluation results between synchronized and unsynchronized satellite images with ground sampling times. The results showed that the FSDAF model’s fused image was highly similar to the original image in spectral reflectance, with a coefficient of determination (R2) exceeding 0.8 and a root mean square error (RMSE) below 0.029. This model effectively compensates for the missing fine-resolution satellite images synchronized with ground sampling times. The optimal salinity indices for evaluating the SSC of arable land in arid areas are S3, S5, SI, SI1, SI3, SI4, and Int1. These indices show a high correlation with SSC based on both synchronized and unsynchronized satellite images with ground sampling times. SSC evaluation models based on synchronized satellite images with ground sampling times were more accurate than those based on unsynchronized images. This indicates that synchronizing satellite images with ground sampling times significantly impacts SSC evaluation accuracy. Among the three models, the CNN model demonstrates the highest predictive accuracy in SSC evaluation based on synchronized and unsynchronized satellite images with ground sampling times, indicating its significant potential in image prediction. The optimal evaluation scheme is the CNN model based on satellite image synchronized with ground sampling times, with an R2 of 0.767 and an RMSE of 1.677 g·kg−1. Therefore, we proposed a framework for integrated spatiotemporal data fusion and CNN algorithms for evaluating soil salinity, which improves the accuracy of soil salinity evaluation. The results provide a valuable reference for the real-time, rapid, and accurate evaluation of soil salinity of arable land in arid areas.