Input map and feature selection for soil legacy data

Trevan Flynn,Andrei Rozanov,Cathy Clarke

doi:10.1016/j.geoderma.2020.114452

Abstract

Techniques that disaggregate complex soil-terrain polygons from legacy maps are becoming more relevant, as cost effective highly detailed soil information is required to advise agriculture, hydrology, ecology, engineering, and a variety of other disciplines. Disaggregation involves the spatial placement of individual soil classes from soil legacy polygons which have multiple soil classes, while specifying the approximate proportion of each soil class and verbally or diagrammatically explaining their distribution in the landscape. One of the most common disaggregation approaches is known as DSMART (“Disaggregation and Harmonisation of Soil Map Units through Resampled Classification Trees”). However, DSMART is computationally intensive and has many parameters that must be optimised. This study aimed to address these drawbacks including input map selection, feature selection, and resample size optimisation. The research site was selected in the upper reaches of the Mvoti river catchment covering 317 km2 in KwaZulu Natal province, South Africa. The catchment consists of 20 soil-terrain polygons drawn at a 1:250,000 scale from the South African Land Type Survey (LTS). First, the optimal input map derived from landform elements (geomorphons) was selected through a spatially resampled Cramer’s V test to determine the association between the legacy polygons (proportion of terrain) and the geomorphon units. This was done for five different aggregated geomorphons with different parameters. Second, three feature selection algorithms (FSAs) were embedded into DSMART to determine if the algorithms could improve accuracy and computational efficiency. Third, the FSAs were compared using 25, 50, 100, and 200 resamples per polygon. The results indicate that the Cramer’s V test is a rapid method to determine the optimal input map. All FSAs achieved a significantly greater accuracy then when disaggregating the original legacy polygons and were more computationally efficient than when using all 52 covariates. This study has implications when disaggregating large and small datasets by improving computational efficiency while maintaining an acceptable accuracy.

Full Text