The development of accurate precipitation products of high spatio-temporal coverage is crucial for a wide range of applications. In this context, precipitation data merging (PDM), which entails the blending of satellite-based estimates with ground-based measurements, holds a prominent position, while currently there is an increasing trend in the deployment of machine learning (ML) algorithms in such endeavors. In the light of recent advances in the field, this work discusses key aspects of the PDM problem associated with: a) the conceptual formulation of the problem, that is closely related to the training of ML models and their predictive capacity, b) the selection of products fused, that is associated with the latency of final product and operational applicability of the method, c) the efficiency of single-step and two-step merging approaches, with the former one treating the problem via only regression algorithms, and the latter one via the combined use of classification and regression algorithms. By formulating PDM as a spatio-temporal prediction problem, we define and assess two different training strategies for the ML models, termed as full and per time step strategy, which entail the building of a single or several ML models, respectively. Furthermore, the performance of the full training strategy, which allows the development of predictions in both spatial and temporal dimensions, is assessed in the context of single-step and two-step merging. In each of the three scenarios, three popular ensemble tree-based ML algorithms are employed, i.e., the random forest, gradient boosting and extreme gradient boosting algorithm, resulting in nine merged products. To provide empirical evidence, we employ a datacube composed by ground-based daily precipitation observations, satellite-based and reanalysis estimates, as well as auxiliary covariates, from 1009 uniformly distributed cells (representative of a sampling area of 25 × 25 km), over four countries around the world (Australia, USA, India and Italy). The large-scale experiment indicates that: (i) full training strategy is a competitive alternative to the per time step strategy, since it enables the development of methods with improved accuracy, with respect to performance metrics and reproduction of statistics, but also with higher predictive capability and operational applicability, (ii) two-step merging enables a much better reproduction of precipitation occurrence characteristics, as reflected in the improvement of relevant categorical metrics, the reproduction of probability and autocorrelation coefficient, (iii) no significant difference was noticed in the performance of different ML algorithms.
Read full abstract