Spatially clustered sampling may result in non-independent data that pose challenges for environmental mapping applications. Two outstanding challenges resulting from the use of spatially clustered data for predictive geospatial modelling with machine learning approaches are biased model training and validation. These issues can be severe for popular bagging models such as Random Forest, yet one or both are often ignored or are handled using sub-optimal approaches. We propose to address these challenges using information on both the spatial autocorrelation of map errors and the spatial sampling intensity. This is achieved by applying the residual spatial covariance as a weighting function for the bagging procedure and for the calculation of weighted validation statistics. Using this approach, the full feature space of the sample data is retained during model training and validation. The utility of covariance weighting for these purposes is investigated through extensive simulation with a range of sample clustering configurations. Results are benchmarked against existing approaches. Covariance weighting improved model performance across a range of clustering scenarios but appeared to produce the greatest improvements for highly clustered data. Covariance-weighted validation demonstrated low bias across a broad range of clustering scenarios compared to existing spatial methods. Findings also suggest, though, that conditional Gaussian simulation approaches may perform well when the proportion of clustered data is very high. Covariance weighting is straightforward to implement, computationally efficient, and scales to different sample sizes and spatial extents.
Read full abstract