Abstract

There are relatively few studies that explicitly evaluate the performance of machine learning algorithms (MLAs) such as decision trees while varying conditions like data splitting strategies and feature selection methods in digital soil mapping (DSM). Since several more powerful black-box models such as Random Forest (RF) exist, regular models like the Classification and Regression Tree (CART) are least applied despite being more intelligible than the former. We demonstrate a simple yet relevant way to improve the performance of a CART model for DSM while still benefiting from its intelligibility, interpretability and intuition potential. Soil organic carbon (SOC) levels across the Czech Republic are predicted at 30 m × 30 m resolution using selected covariates coupled with respective CART models. For this work, 440 topsoils (0–20 cm) for the Czech Republic were retrieved from the LUCAS soil database. Regarding the distinct CART models, data splitting strategies (Random, SPlit and Conditional Latin Hypercube Sampling: cLHS) and 7 feature selection methods were varied. Meanwhile, overall model results were compared using accuracy metrics including the root mean square error (RMSE). One of the satisfactory SOC model validation results based on SPlit has a root mean square error (RMSE) of 17.30 g/kg and a coefficient of determination (R2) of 0.52. The cLHS proves robust for model data splitting. Feature selection methods including Stepwise Regression (SWR), Recursive Feature Elimination (RFE) and the Genetic Algorithm (GA) were considered computationally efficient for identifying relevant covariates. Generally, the study demonstrates the relevance and effectiveness of varying data splitting strategies and feature selection methods for improving SOC modelling via a decision tree (CART).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call