The spatial resolution (250–1000 m) of the FY-3D MERSI is too coarse for agricultural monitoring at the farmland scale (20–30 m). To achieve the winter wheat yield (WWY) at the farmland scale, based on FY-3D, a method framework is developed in this work. The enhanced deep convolutional spatiotemporal fusion network (EDCSTFN) was used to perform a spatiotemporal fusion on the 10 day interval FY-3D and Sentinel-2 vegetation indices (VIs), which were compared with the enhanced spatial and temporal adaptive reflectance fusion model (ESTARFM). In addition, a BP neural network was built to calculate the farmland-scale WWY based on the fused VIs, and the Aqua MODIS gross primary productivity product was used as ancillary data for WWY estimation. The results reveal that both the EDCSTFN and ESTARFM achieve satisfactory precision in the fusion of the Sentinel-2 and FY-3D VIs; however, when the period of spatiotemporal data fusion is relatively long, the EDCSTFN can achieve greater precision than ESTARFM. Finally, the WWY estimation results based on the fused VIs show remarkable correlations with the WWY data at the county scale and provide abundant spatial distribution details about the WWY, displaying great potential for accurate farmland-scale WWY estimations based on reconstructed fine-spatial-temporal-resolution FY-3D data.