Accurate estimation of gross primary production (GPP) of paddy rice fields is essential for understanding cropland carbon cycles, yet remains challenging due to spatial heterogeneity. In this study, we integrated high-resolution unmanned aerial vehicle (UAV) imagery into a leaf biochemical properties-based model for improving GPP estimation. The key parameter, maximum carboxylation rate at the top of the canopy (Vcmax,025), was quantified using various spatial information representation methods, including mean (μref) and standard deviation (σref) of reflectance, gray-level co-occurrence matrix (GLCM)-based features, local binary pattern histogram (LBPH), and convolutional neural networks (CNNs). Our models were evaluated using a two-year eddy covariance (EC) system and UAV measurements. The result shows that incorporating spatial information can vastly improve the accuracy of Vcmax,025 and GPP estimation. CNN methods achieved the best Vcmax,025 estimation, with an R of 0.94, an RMSE of 19.44 μmol m−2 s−1, and an MdAPE of 11%, and further produced highly accurate GPP estimates, with an R of 0.92, an RMSE of 6.5 μmol m−2 s−1, and an MdAPE of 23%. The μref-GLCM texture features and μref-LBPH joint-driven models also gave promising results. However, σref contributed less to Vcmax,025 estimation. The Shapley value analysis revealed that the contribution of input features varied considerably across different models. The CNN model focused on nir and red-edge bands and paid much attention to the subregion with high spatial heterogeneity. The μref-LBPH joint-driven model mainly prioritized reflectance information. The μref-GLCM-based features joint-driven model emphasized the role of GLCM texture indices. As the first study to leverage the spatial information from high-resolution UAV imagery for GPP estimation, our work underscores the critical role of spatial information and provides new insight into monitoring the carbon cycle.