A fine-resolution digital soil class map is needed. However, the problem of imbalanced data leads to an inaccurate spatial distribution of the digital soil class map, and the spatial resolution of digital soil class maps at a large scale is low in existing studies. Based on these points, an algorithm of over-sampling and under-sampling was introduced to solve the problem of imbalanced data, and to improve the performance of soil classification model. 316 topsoil samples with eight main soil classes at the great group level were collected in Eastern China. Eight out of twelve prediction variables were determined after the importance evaluation by “Mean Decrease Accuracy” in the random forest (RF) model, including digital elevation model (DEM), enhanced vegetation index (EVI), land surface wetness index (LSWI), land surface temperature (LST), normalized differenced vegetation index (NDVI), and soil texture components. RF model was also applied to complete digital soil class mapping, and the results of treated (over-sampling and under-sampling by randomly increasing or decreasing the number of samples) and untreated data were compared and discussed. Research results indicated that modeling by imbalanced data resulted in uncertain soil classes mapping, with minority classes were lost and with lower accuracies than those of balanced data (overall accuracy = 83.83 %, kappa coefficient = 0.79). After over-sampling and under-sampling treatments, these problems were well solved with an overall accuracy of 96.72 % and a kappa coefficient of 0.93. The accuracy of soil class prediction for minority classes were improved by 12.5 %–54.5 %. Compared to the existing conventional soil map, the new map with a fine resolution of 30 × 30 m is time-effective and more detailed. Validation (point-validation and map-to-map comparison) of the predicted map showed that the output is reliable and can provide a reference for other soil and environmental studies without major difficulties.
Read full abstract