Statistical modelling of measurement error in wet chemistry soil data

Cynthia Van Leeuwen,Gerard Heuvelink,Niels Batjes,Titia Mulder

doi:10.5194/egusphere-egu21-1272

Abstract

&lt;p&gt;There is a growing demand for high quality soil data to model soil processes and map soil properties. However, wet chemistry measurements on soil properties are subjected to many error sources, such as the observer, the instrument and lack of standardised methodologies. Consequently, soil data are imperfect and uncertain because of these error sources. Uncertainties in measurements of fundamental soil properties can propagate through, e.g., pedotransfer functions, spectroscopic models and digital soil mapping algorithms. Therefore, it is important to provide detailed uncertainty information about soil measurements to potential data users. In practice, uncertainty estimates are rarely specified by providers of analytical soil data.&lt;/p&gt;&lt;p&gt;In this research, we aimed to quantify uncertainties in synthetic and real-world pH (1:1 soil-water suspension) and Total Organic Carbon (TOC) measurements. We assumed that uncertainty can be represented by a normal distribution. A linear mixed-effects model was applied to estimate the parameters of the normal distribution, i.e., mean and standard deviation, of both synthetic and real-world datasets. The model included &amp;#8216;sample ID&amp;#8217; as a fixed effect, and &amp;#8216;batch&amp;#8217; and &amp;#8216;laboratory&amp;#8217; as random effects. The use of synthetic datasets allowed us to investigate how well the model parameters could be estimated given a specific experimental measurement design, whereas the real-world case served to explore if the parameter estimates were still accurate for such unbalanced datasets.&lt;/p&gt;&lt;p&gt;For a balanced dataset (&lt;em&gt;n&lt;/em&gt;=20, &lt;em&gt;n&lt;/em&gt;=100, &lt;em&gt;n&lt;/em&gt;=200 and &lt;em&gt;n&lt;/em&gt;=500), using synthetic pH data for three hypothetical laboratories (two batches per laboratory), the mean estimated standard deviations (&amp;#963;) of the random effects were &amp;#963;&lt;sub&gt;batch&lt;/sub&gt;=0.10, &amp;#963;&lt;sub&gt;laboratory&lt;/sub&gt;=0.24 and &amp;#963;&lt;sub&gt;residual&lt;/sub&gt;=0.2. These estimates were in agreement with the &amp;#963; for the respective random effects used to generate the synthetic dataset, meaning that the model could accurately estimate the model parameters. Subsequently, changes were made to the experimental measurement design by randomly removing 20%, 50% and 80% of the data, resulting in unbalanced datasets. In general, the interquartile range (IQR) of &amp;#963; for each random effect increased with a larger percentage of removed data. However, the increase in IQR was larger for &lt;em&gt;n&lt;/em&gt;=20 compared to, e.g., n=200. When comparing 0% and 80% randomly removed data, the IQR for the batch effect increased with 60.3%. Conversely, for &lt;em&gt;n&lt;/em&gt;=200 an increase of only 23.5% was observed.&lt;/p&gt;&lt;p&gt;Subsequently, the same model was fitted on real-world pH and TOC data, provided by the Wageningen Evaluating Programs for Analytical Laboratories (WEPAL). The unbalanced dataset structure was first reconstructed and filled with synthetically generated data, based on sample means and standard deviations derived from the measured data. The model was fitted on both datasets. For measured pH, the model yielded &amp;#963;&lt;sub&gt;batch&lt;/sub&gt;=0.27, &amp;#963;&lt;sub&gt;laboratory&lt;/sub&gt;=0.17 and &amp;#963;&lt;sub&gt;residual&lt;/sub&gt;=0.10. The IQRs of the estimated &amp;#963; from synthetic WEPAL data were 0.04 (batch), 0.06 (laboratory) and 0.02 (residual). The model fitted on the measured TOC data estimated &amp;#963;&lt;sub&gt;batch&lt;/sub&gt;=5.3%, &amp;#963;&lt;sub&gt;laboratory&lt;/sub&gt;=2.8% and &amp;#963;&lt;sub&gt;residual&lt;/sub&gt;=2.1%. For the synthetic WEPAL data, IQRs of 1.3% (batch), 1.4% (laboratory) and 0.4% (residual) were determined for the estimated &amp;#963;. These findings suggest that despite having a highly unbalanced dataset, realistic model parameter estimates can still be obtained.&lt;/p&gt;

Highlights

A soil system's physical and chemical properties are commonly determined by the collection and subsequent wet chemistry analysis of soil samples
The interquartile range (IQR) of the residual variance estimates dropped by 80% between n 1⁄4 20 and n 1⁄4 500
Our expectations were in line with the observed IQRs for the batch effect and residual variance, where the difference in IQR for 0 and 80% removed data was largest for n 1⁄4 20 (171% increase in batch effect variance IQR)

Summary

Introduction

A soil system's physical and chemical properties are commonly determined by the collection and subsequent wet chemistry analysis of soil samples. The results from wet chemistry measurements can be further used to, for instance, develop soil spectroscopy models (McBratney, Minasny, & Rossel, 2006) or estimate soil organic carbon stocks (Smith et al, 2020). GLOSOLAN aims to build laboratory capacity and improve the provision of reliable and comparable soil data by harmonizing methods, units, data and information. Factors that often contribute to measurement error are the analyst, complex wet chemistry methodologies, varying measurement conditions (e.g., temperature and humidity), a variety of different sample preparation methods and the measurement instrument itself (Allchin, 2001; Libohova et al, 2019; Viscarra Rossel & McBratney, 1998). We aimed to quantify the uncertainty associated with defined analytical methods, building upon the need for high-quality soil data

Objectives

Findings

Discussion

Conclusion