Despite the success of using soil spectroscopy in studies to predict soil attributes, like soil organic carbon (SOC), recent work has revealed several limitations to this approach: a tendency for model overfitting and a lack of transparency of machine learning (ML) methods. Thus, we aimed to both test the ability to improve the generalizability of the models to predict SOC using a cross-validation (CV) strategy oriented to soil profiles and to test the gain in model interpretability by using the least absolute shrinkage and selection operator (LASSO) regression method instead of the commonly used partial least squares (PLS) method. We used one soil spectral library composed of 127 soil profiles (n = 701), from Northeast Brazil, containing reflectance data from the visible, near, and short-wave infrared (VNIR) and the mid-infrared (MIR) spectral regions. We tuned the ML models to predict SOC via two CV strategies: the standard k-fold CV and the leave-soil-profile-out (LSPO) CV. We found that LSPO CV can produce models with better generalizability, as they lose less accuracy than the ones trained with k-fold CV. We conclude that disregarding the autocorrelation of SOC within the soil profile can produce models that are prone to overfitting. In addition, LASSO used 105 covariables from VNIR and 190 from MIR for a total of 8604 and 13,336 covariables, respectively. Moreover, a few LASSO covariables correlated with SOC and are associated with both electronic transitions and vibrational bonds in organic compounds, so the possibility and ease of identifying spectral bands and their correlation with organic carbon indicate that the LASSO models presented more transparent models than the PLS models.
Read full abstract