In a digital soil mapping (DSM) context, machine learning (ML) algorithms are widely used to model soil textural classes (STCs). However, in the real world most soil class datasets exhibit imbalanced distributions. This poses a challenge as ML algorithms are designed to handle balanced classes, leading to a bias towards the majority classes while often overlooking the minority classes. Furthermore, within the DSM framework, two strategies can be employed to model STCs: direct and indirect approaches. In the direct approach, STCs are directly inputted into the model for prediction. In contrast, the indirect approach involves introducing soil texture fractions (i.e., clay, silt, sand) as initial inputs, then STCs are obtained from the outputs. Limited research has been conducted on the impact of data balancing on STC predictions, and there is a lack of comparative analysis between direct and indirect approaches in this context. Therefore, this study aimed to evaluate the efficacy of a resampling technique (SMOTE: synthetic minority oversampling technique) in handling an imbalanced soil texture dataset collected from the Kuhdasht region in western Iran. Additionally, the study sought to compare the performance of direct and indirect modeling approaches. Environmental covariates derived from Landsat 8 and Sentinel 2 images along with a digital elevation model (DEM) were used as input variables to a random forest (RF) model to model STCs and soil texture fractions. The results revealed that terrain attributes and Euclidean distances played a more significant role in modeling both balanced and imbalanced datasets compared to remotely sensed data. Kappa indices for balanced and imbalanced datasets, as well as for the indirect approach were found to be 89%, 68% and 38% respectively. In the same way, the overall accuracies were 91%, 79% and 68%, respectively. Among the imbalanced classes, clay loam and loam which accounted for the majority of observations showed the highest recall values, followed by sandy clay loam, sandy loam and silty clay loam. When employing the indirect approach, the RF model failed to capture the minority classes in terms of validation statistics. Additionally, modeling with the imbalanced dataset resulted in the exclusion of three minority STCs from the final map. Overall, this study showed the importance of balancing STCs prior to modeling to achieve more accurate estimates of STCs, as well as the superiority of employing the direct approach (using balanced data sets) over the indirect approach.
Read full abstract