In search of an optimum sampling algorithm for prediction of soil properties from infrared spectra.

Wartini Ng,Budiman Minasny,Patrick Filippi,Brendan Malone

doi:10.7717/peerj.5722

Abstract

BackgroundThe use of visible-near infrared (vis-NIR) spectroscopy for rapid soil characterisation has gained a lot of interest in recent times. Soil spectra absorbance from the visible-infrared range can be calibrated using regression models to predict a set of soil properties. The accuracy of these regression models relies heavily on the calibration set. The optimum sample size and the overall sample representativeness of the dataset could further improve the model performance. However, there is no guideline on which sampling method should be used under different size of datasets.MethodsHere, we show different sampling algorithms performed differently under different data size and different regression models (Cubist regression tree and Partial Least Square Regression (PLSR)). We analysed the effect of three sampling algorithms: Kennard-Stone (KS), conditioned Latin Hypercube Sampling (cLHS) and k-means clustering (KM) against random sampling on the prediction of up to five different soil properties (sand, clay, carbon content, cation exchange capacity and pH) on three datasets. These datasets have different coverages: a European continental dataset (LUCAS, n = 5,639), a regional dataset from Australia (Geeves, n = 379), and a local dataset from New South Wales, Australia (Hillston, n = 384). Calibration sample sizes ranging from 50 to 3,000 were derived and tested for the continental dataset; and from 50 to 200 samples for the regional and local datasets.ResultsOverall, the PLSR gives a better prediction in comparison to the Cubist model for the prediction of various soil properties. It is also less prone to the choice of sampling algorithm. The KM algorithm is more representative in the larger dataset up to a certain calibration sample size. The KS algorithm appears to be more efficient (as compared to random sampling) in small datasets; however, the prediction performance varied a lot between soil properties. The cLHS sampling algorithm is the most robust sampling method for multiple soil properties regardless of the sample size.DiscussionOur results suggested that the optimum calibration sample size relied on how much generalization the model had to create. The use of the sampling algorithm is beneficial for larger datasets than smaller datasets where only small improvements can be made. KM is suitable for large datasets, KS is efficient in small datasets but results can be variable, while cLHS is less affected by sample size.

Highlights

In the last few decades, there has been growing interest in rapid soil characterisation
The performance of the partial least squares regression (PLSR) and Cubist regression model was evaluated on five soil properties for the continental and regional dataset and four soil properties for the local dataset
For a comparison between the effects of regression models, only the performance of random sampling method is discussed

Summary

Introduction

In the last few decades, there has been growing interest in rapid soil characterisation. Soil spectra absorbance from the visible-infrared range can be calibrated using regression models to predict a set of soil properties. We analysed the effect of three sampling algorithms: Kennard-Stone (KS), conditioned Latin Hypercube Sampling (cLHS) and k-means clustering (KM) against random sampling on the prediction of up to five different soil properties (sand, clay, carbon content, cation exchange capacity and pH) on three datasets. These datasets have different coverages: a European continental dataset (LUCAS, n = 5,639), a regional dataset from Australia (Geeves, n = 379), and a local dataset from New South Wales, Australia (Hillston, n = 384). KM is suitable for large datasets, KS is efficient in small datasets but results can be variable, while cLHS is less affected by sample size

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PeerJ	Publication Date: Oct 3, 2018
Citations: 40	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

In search of an optimum sampling algorithm for prediction of soil properties from infrared spectra.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PeerJ

Lead the way for us

Similar Papers

Divergence metrics for determining optimal training sample size in digital soil mapping
Daniel D Saurette ... Asim Biswas
Geoderma | VOL. 436
Daniel D Saurette, et. al.Daniel D Saurette ... Asim Biswas
01 Jun 2023
Geoderma | VOL. 436

Optimal and maximin sample sizes for multicentre cost-effectiveness trials.
Md Abu Manju ... Martijn Pf Berger
Statistical Methods in Medical Research | VOL. 24
Md Abu Manju, et. al.Md Abu Manju ... Martijn Pf Berger
05 Feb 2015
Statistical Methods in Medical Research | VOL. 24

Quantifying the effect of prediction uncertainty from soil spectroscopy on soil management 
Alice Milne ... Stephan Haefele
-
Alice Milne, et. al.Alice Milne ... Stephan Haefele
28 Mar 2022
28 Mar 2022

Accuracy of Estimating Soil Properties with Mid‐Infrared Spectroscopy: Implications of Different Chemometric Approaches and Software Packages Related to Calibration Sample Size
Bernard Ludwig ... Michael Vohland
Soil Science Society of America Journal | VOL. 83
Bernard Ludwig, et. al.Bernard Ludwig ... Michael Vohland
01 Sep 2019
Soil Science Society of America Journal | VOL. 83

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

In search of an optimum sampling algorithm for prediction of soil properties from infrared spectra.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PeerJ