Evaluating validation strategies on the performance of soil property prediction from regional to continental spectral data

Songchao Chen,Hanyi Xu,Dongyun Xu,Wenjun Ji,Shuo Li,Meihua Yang,Bifeng Hu,Yin Zhou,Nan Wang,Dominique Arrouays,Zhou Shi

doi:10.1016/j.geoderma.2021.115159

Abstract

Visible-near infrared (vis–NIR) spectroscopy has been widely used to characterize soil information from field to global scales. Before applying a calibrated spectral predictive model to acquire soil information, either independent validation or k-fold cross validation is used to evaluate model performance. However, there is no consensus on which validation strategy is more suitable and robust when evaluating model performance for the studies in different scales. The objective of this study is to evaluate and compare the model performance of two validation strategies coupling different calibration sizes (a ratio of calibration to validation of 2:1, 4:1 and 9:1) and calibration sampling strategies (random sampling (RS), rank, Kennard-Stone (KS), rank-Kennard-Stone (RKS) and conditioned Latin hypercube sampling (cLHS)) across scales. A total of 17,272 vis–NIR spectra of mineral soils from LUCAS data (continental scale) and their soil organic carbon (SOC) and clay contents were used in this study, and the dataset was further split into national (2761 samples in France) and five regional datasets (110 to 248 samples from five French administrative regions). To eliminate the effect of changing validation set on the model performance, a consistent test set (20% of total samples at each scale) was split to evaluate all the combinations involved in two validation strategies. The Lin’s concordance correlation coefficient (CCC) of the cubist model were stable for both SOC and clay for different calibration sizes, calibration sampling and validation strategies for a large calibration size (>1400) at the national and continental scales. A larger calibration size can potentially improve model performance for a small dataset (<300) at the regional scale, and a wider calibration range would result in better model performance. No silver bullet was found among the different calibration sampling strategies at the regional scale. For five French regions (small data set), we found a high variation (95th percentile minus the 5th percentile) in the CCC among the models built from 50 repeated RS (0.10–0.44 for SOC, 0.16–0.52 for clay) and cLHS (0.08–0.40 for SOC, 0.12–0.36 for clay). This finding indicates that a one-time RS or cLHS for selecting the calibration set has high uncertainty in model evaluation for a small dataset and therefore should be used with caution. Therefore, we suggest the following: (1) for a large data set (thousands), either one-time random sampling for independent validation or k-fold cross validation would be appropriate; (2) for a small data set (dozens to hundreds), k-fold cross validation and/or repeated random sampling for independent validation would be more robust for spectral predictive model evaluation.

Full Text