Legacy soil datasets are a valuable resource and should be used to the greatest extent possible. However, such datasets may be incomplete, and lack observations for every attribute, as the dataset may be compiled from multiple studies. To use these datasets in soil mapping and modeling work, it is useful to fill the gaps in the dataset with estimated values. Machine learning is an approach that can provide estimates with high accuracy. In this study, the machine learner Random Forest (RF) was used to estimate bulk density values in an existing dataset from the province of British Columbia (BC), Canada, which was used as a case study dataset. As the dataset had missing observations across all attributes, multiple models needed to be generated and tested to determine the accuracy of the estimated values produced. A total of 512 models were tested using RF, which were then ranked based on the concordance correlation coefficient (CCC) of their estimates; the CCC of all models ranged from 0.51 to 0.92. The estimates of 27 of these models were then used to fill the missing observations in the dataset; the accuracy of these models ranged from a CCC of 0.63 to 0.92. Further, uncertainty estimates for the predictions were generated using quantile regression, which was coupled with the RF approach. Each model tested therefore had an accuracy measurement and an uncertainty estimate. This approach, of using multiple models developed in RF, can be applied to other legacy soil datasets with inconsistently missing values to produce estimates which can fill the missing observations, and produce uncertainty estimates for those estimates.
Read full abstract