Divergence metrics for determining optimal training sample size in digital soil mapping

Daniel D Saurette,Richard J Heck,Adam W Gillespie,Aaron A Berg,Asim Biswas

doi:10.1016/j.geoderma.2023.116553

Abstract

Digital soil mapping (DSM) typically requires three common ingredients: georeferenced samples, environmental covariates, and a model. Of the three, sample design, or the selection of sample size and locations, has received considerably less attention. This is not surprising given that most studies are primarily limited by budget, the result being a focus on stratification of sampling locations in covariate (feature) space with less emphasis placed on the sample size. At the very least, determining the optimal sample size, regardless of whether it is achievable within a given budget, provides critical information about the loss of information from not collecting enough samples for a given study area. In this study, we evaluated the use of the Kullback-Leibler divergence (DKL), the Jensen-Shannon divergence (DJS), the Jenson-Shannon distance (DistJS), and the normalized variance in determining an optimal sample size for predicting total soil carbon at the field scale. The divergence metrics were computed for replicated (n = 10) sample plans using the conditioned Latin hypercube sampling algorithm across increasing samples sizes of 10, 25, and 50 to 400 in steps of 50 to determine an optimal sample size; the sensitivity of the divergence metrics to increasing the number of covariates and the number of bins for their computations were evaluated. The random forest algorithm was used to train predictive models using the same replicated sample sizes to determine the required sample size to optimize model performance based on root mean square error and Lin’s concordance correlation coefficient. The divergence metrics were insensitive to the number of covariates, but very sensitive to the number of bins specified for their calculation. On average, optimal sample size increased linearly (two additional samples per additional bin) regardless of the number of covariates used. The optimal sample sizes were 124, 133 and 220 for the DKL, DJS and DistJS divergence metrics, respectively, while the variance technique proved to be unreliable. Based on the model performance metrics from model validation, the optimal sample size ranged from 146 to 154 samples. The DistJS overestimated the optimal sample size considerably, while the DKL and DJS were quite similar to the optimal sample size determined from model validation. Future work should evaluate the use of divergence metrics for determining optimal sample size for multiple soil properties or classes, using various machine learning models, across different project scales, and with other sampling algorithms.

Full Text