Improving quantitative precipitation estimation (QPE) in sparsely-gauged basins via merging remote sensing precipitation data and rain gauge data still remains a challenge since most existing merging models degrade when the rain gauge data become limited. To address the challenge, we propose an attention-mechanism-based deep learning model, Multi-Level Transformer Fusion (MLTF) model, which allows to capture the inner interactions among the multi-source input data (TRMM 3B42 V7 data, GridSat-B1 data, and DEM) in the sparsely-gauged basin. Taking the source region of the Yellow River basin (SRYRB) as a representative case study, we demonstrate the performance of the proposed model and compare it with conventional methods (e.g., Multiplicative Bias Correction, Additive Bias Removal, linear regression, and Kriging) and deep-learning-based models (CNN, CNN-LSTM). Results indicate that the merged precipitation in SRYRB produced by the MLTF model exhibits a RMSE reduction of 27.1 %, MAE decrease of 11.2 %, and CC increase of 19.2 % in comparison to the original TRMM data, outperforming all the selected comparative methods. Finally, an improved daily precipitation dataset during 1999–2019 with a spatial resolution of 0.05° is produced for the study area. This study proposes a new method for QPE improvement in a sparsely-gauged basin, which would provide valuable data support for regional hydrological study and water resources management.