The influence of training sample size on the accuracy of deep learning models for the prediction of soil properties with near-infrared spectroscopy data

Wartini Ng,Budiman Minasny,Wanderson De Sousa Mendes,José Alexandre Melo Demattê

doi:10.5194/soil-6-565-2020

Wartini Ng, Budiman Minasny + Show 2 more

Open Access

https://doi.org/10.5194/soil-6-565-2020

Copy DOI

Journal: SOIL	Publication Date: Nov 17, 2020
Citations: 97	License type: CC BY 4.0

Affiliation: University of Sydney, Universidade de São Paulo

Abstract

Abstract. The number of samples used in the calibration data set affects the quality of the generated predictive models using visible, near and shortwave infrared (VIS–NIR–SWIR) spectroscopy for soil attributes. Recently, the convolutional neural network (CNN) has been regarded as a highly accurate model for predicting soil properties on a large database. However, it has not yet been ascertained how large the sample size should be for CNN model to be effective. This paper investigates the effect of the training sample size on the accuracy of deep learning and machine learning models. It aims at providing an estimate of how many calibration samples are needed to improve the model performance of soil properties predictions with CNN as compared to conventional machine learning models. In addition, this paper also looks at a way to interpret the CNN models, which are commonly labelled as a black box. It is hypothesised that the performance of machine learning models will increase with an increasing number of training samples, but it will plateau when it reaches a certain number, while the performance of CNN will keep improving. The performances of two machine learning models (partial least squares regression – PLSR; Cubist) are compared against the CNN model. A VIS–NIR–SWIR spectra library from Brazil, containing 4251 unique sites with averages of two to three samples per depth (a total of 12 044 samples), was divided into calibration (3188 sites) and validation (1063 sites) sets. A subset of the calibration data set was then created to represent a smaller calibration data set ranging from 125, 300, 500, 1000, 1500, 2000, 2500 and 2700 unique sites, which is equivalent to a sample size of approximately 350, 840, 1400, 2800, 4200, 5600, 7000 and 7650. All three models (PLSR, Cubist and CNN) were generated for each sample size of the unique sites for the prediction of five different soil properties, i.e. cation exchange capacity, organic carbon, sand, silt and clay content. These calibration subset sampling processes and modelling were repeated 10 times to provide a better representation of the model performances. Learning curves showed that the accuracy increased with an increasing number of training samples. At a lower number of samples (< 1000), PLSR and Cubist performed better than CNN. The performance of CNN outweighed the PLSR and Cubist model at a sample size of 1500 and 1800, respectively. It can be recommended that deep learning is most efficient for spectra modelling for sample sizes above 2000. The accuracy of the PLSR and Cubist model seems to reach a plateau above sample sizes of 4200 and 5000, respectively, while the accuracy of CNN has not plateaued. A sensitivity analysis of the CNN model demonstrated its ability to determine important wavelengths region that affected the predictions of various soil attributes.

Highlights

There has been an increasing demand for a rapid and costeffective method as an alternative to conventional laboratory soil analysis
Absorption near 1400 nm is associated with the first overtone of an O–H stretch vibration of water or metal–O–H vibration, while absorption is 1900 nm is combination vibrations of water related to H–O–H bend and O–H stretch (Viscarra Rossel et al, 2009)
We assessed the effect of the training sample size and identified important wavelengths in predicting various soil properties using Cubist and convolutional neural network (CNN) models

Summary

Introduction

There has been an increasing demand for a rapid and costeffective method as an alternative to conventional laboratory soil analysis. Near and shortwave infrared (VIS– NIR–SWIR) spectroscopy have been proposed to be used as an alternative tool for soil analysis for the last few decades (Bendor and Banin, 1995; Shepherd and Walsh, 2002; Stenberg et al, 2010). This method enables the simultaneous prediction of various properties and has non-destructive characteristics. The performances of these regression models are dependent on the spectral preprocessing methods (Rinnan et al, 2009) and the size and representativeness of the calibration samples (Kuang and Mouazen, 2012; Ng et al, 2018). Several studies demonstrated that the performance of the machine learning model did not increase significantly, or it even plateaued, as the calibration sample size increased (Figueroa et al, 2012; Ramirez-Lopez et al, 2014; Ng et al, 2018)

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The influence of training sample size on the accuracy of deep learning models for the prediction of soil properties with near-infrared spectroscopy data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: SOIL

Lead the way for us

Similar Papers

Generation and classification of patch-based land use and land cover dataset in diverse Indian landscapes: a comparative study of machine learning and deep learning models.
Nyenshu Seb Rengma ... Manohar Yadav
Environmental monitoring and assessment | VOL. 196
Nyenshu Seb Rengma, et. al.Nyenshu Seb Rengma ... Manohar Yadav
22 May 2024
Environmental monitoring and assessment | VOL. 196

A hybrid CNN and ensemble model for COVID-19 lung infection detection on chest CT scans.
Ahmed A Akl ... Ahmad Salah
PLOS ONE | VOL. 18
Ahmed A Akl, et. al.Ahmed A Akl ... Ahmad Salah
09 Mar 2023
PLOS ONE | VOL. 18

An Automatic Non-Destructive External and Internal Quality Evaluation of Mango Fruits based on Color and X-ray Imaging with Machine Learning and Deep Learning Based Classification Models
Vani Ashok ... Bharathi R K
Inteligencia Artificial | VOL. 26
Vani Ashok, et. al. Vani Ashok ... Bharathi R K
29 Sep 2023
Inteligencia Artificial | VOL. 26

SHAP values accurately explain the difference in modeling accuracy of convolution neural network between soil full-spectrum and feature-spectrum
Liang Zhong ... Jianlong Li
Computers and Electronics in Agriculture | VOL. 217
Liang Zhong, et. al.Liang Zhong ... Jianlong Li
13 Jan 2024
Computers and Electronics in Agriculture | VOL. 217

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The influence of training sample size on the accuracy of deep learning models for the prediction of soil properties with near-infrared spectroscopy data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: SOIL