Abstract. Remotely sensed Earth observations have many missing values. The abundance and often complex patterns of these missing values can be a barrier for combining different observational datasets and may cause biased estimates of derived statistics. To overcome this, missing values in geoscientific data are regularly infilled with estimates through univariate gap-filling techniques such as spatial or temporal interpolation or by upscaling approaches in which complete donor variables are used to infer missing values. However, these approaches typically do not account for information that may be present in other observed variables that also have missing values. Here we propose CLIMFILL (CLIMate data gap-FILL), a multivariate gap-filling procedure that combines kriging interpolation with a statistical gap-filling method designed to account for the dependence across multiple gappy variables. In a first stage, an initial gap fill is constructed for each variable separately using state-of-the-art spatial interpolation. Subsequently, the initial gap fill for each variable is updated to recover the dependence across variables using an iterative procedure. Estimates for missing values are thus informed by knowledge of neighbouring observations, temporal processes, and dependent observations of other relevant variables. CLIMFILL is tested using gap-free ERA-5 reanalysis data of ground temperature, surface-layer soil moisture, precipitation, and terrestrial water storage to represent central interactions between soil moisture and climate. These variables were matched with corresponding remote sensing observations and masked where the observations have missing values. In this “perfect dataset approach” CLIMFILL can be evaluated against the original, usually not observed part of the data. We show that CLIMFILL successfully recovers the dependence structure among the variables across all land cover types and altitudes, thereby enabling subsequent mechanistic interpretations in the gap-filled dataset. Correlation between original ERA-5 data and gap-filled ERA-5 data is high in many regions, although it shows artefacts of the interpolation procedure in large gaps in high-latitude regions during winter. Bias and noise in gappy satellite-observable data is reduced in most regions. A case study of the European 2003 heatwave shows how CLIMFILL reduces biases in ground temperature and surface-layer soil moisture induced by the missing values. Furthermore, in idealized experiments we see the impact of fraction of missing values and the complexity of missing value patterns to the performance of CLIMFILL, showing that CLIMFILL for most variables operates at the upper limit of what is possible given the high fraction of missing values and the complexity of missingness patterns. Thus, the framework can be a tool for gap filling a large range of remote sensing observations commonly used in climate and environmental research.
Read full abstract