Sampling and classifier modification to DSMART for disaggregating soil polygon maps

Tahmid Huq Easher,Daniel Saurette,Emma Chappell,Fernando De Jesus Montano Lopez,Marc-Olivier Gasser,Adam Gillespie,Richard J Heck,Brandon Heung,Asim Biswas

doi:10.1016/j.geoderma.2023.116360

Abstract

DSMART has been widely used to disaggregate multi-component soil polygon map units into raster maps. The algorithm randomly selects (simple random sampling approach) an equal number of synthetic sample points from all map units and assigns soil classes proportionate to the map unit composition using the C5.0 algorithm. Conditional Latin Hypercube sampling (cLHS) is a maximally stratified random sampling that guarantees full coverage of multivariate distribution. cLHS has shown success in digital soil mapping as it optimizes sample point selection based on environmental covariates. Information-driven stratification and sampling may help improve the assignment of soil classes within map units. The objective of this study was to compare cLHS against simple random sampling (SRS) approach using three classifier algorithms in DSMART to disaggregate the soil great group and soil series maps of a sub-watershed in Southern Ontario, Canada, as a case study example. SRS and cLHS were applied in two methods, sampling “by polygon” and “by area”, and three classifiers selected were C5.0, random forest (RF), and k-nearest neighbor (K-NN) (a total of twelve combinations − 4 sampling approaches and 3 classifier algorithms). The original R-code of the DSMART package was modified to incorporate cLHS sampling and change classifier algorithms. For the soil great group predictions, RF performed better in combination with “by polygon” SRS and “by polygon” cLHS approaches (Kappa scores 0.57). For the soil series predictions, C5.0 and “by area” cLHS approach (Kappa score 0.50) performed well compared to other combinations (Kappa score 0.25–0.44). Overall, cLHS performed marginally better with C5.0 and RF while, SRS performed better with K-NN. Disaggregated maps using cLHS generated lower prediction uncertainty values.

Full Text