Spatial Random Forest (S-RF): A random forest approach for spatially interpolating missing land-cover data with multiple classes

Jacinta Holloway-Brown,Kate J Helmstedt,Kerrie L Mengersen

doi:10.1080/01431161.2021.1881183

Jacinta Holloway-Brown, Kate J Helmstedt + Show 1 more

Open Access

https://doi.org/10.1080/01431161.2021.1881183

Copy DOI

Abstract

ABSTRACT Land-cover maps are important tools for monitoring large-scale environmental change and can be regularly updated using free satellite imagery data. A key challenge with constructing these maps is missing data in the satellite images on which they are based. To address this challenge, we created a Spatial Random Forest (S-RF) model that can accurately interpolate missing data in satellite images based on a modest training set of observed data in the image of interest. We demonstrate that this approach can be effective with only a minimal number of spatial covariates, namely latitude and longitude. The motivation for only using latitude and longitude in our model is that these covariates are available for all images whether the data are observed or missing due to cloud cover. The S-RF model can flexibly partition these covariates to provide accurate estimates, with easy incorporation of additional covariates to improve estimation if available. The effectiveness of our approach has been previously demonstrated for prediction of two land-cover classes in an Australian case study. In this paper, we extend the method to more than two classes. We demonstrate the performance of the S-RF method at interpolating multiple land-cover classes, using a case study drawn from South America. The results show that the method is best at predicting three land-cover classes, compared with 5 or 10 classes, and that other information is needed to improve performance as the number of classes grows, particularly if the classes are unbalanced. We explore two issues through a sensitivity analysis: the influence of the amount of missing data in the image and the influence of the amount of training data for model development and performance. The results show that the amount of missing data due to cloud cover is influential on model performance for multiple classes. We also found that increasing the amount of training data beyond 100,000 observations had minimal impact on model accuracy. Hence, a relatively small amount of observed data is required for training the model, which is beneficial for computation time.

Full Text