Abstract

The high dimensionality of spectral datasets makes it difficult to select the optimal subset of variables. This paper presents a new method for variable selection called the significant multivariate competitive population analysis (SMCPA), Which combines ideas of significant multivariate correlation (SMC) and model population analysis, and employs weighted bootstrap sampling (WBS) and exponential decline function (EDF) competition methods. In this study, the values of SMC distributions are used as an index for evaluating the importance of each wavelength. Then, based on the importance level of each wavelength. SMCPA sequentially selects N subsets of spectral wavelengths by N Monte Carlo sampling in an iterative and competitive procedure. In each sampling run, a fixed ratio of samples is used to build a calibrated partial least-squares model, and then SMC is performed to obtain the score and threshold values. Next, based on the significant multivariate correlation scores, the key variables are selected by two steps: the compulsory selection of exponential decline function and the competitive selection of adaptive weighted sampling. Finally, cross-validation(CV) is applied to select the optimal subset with the lowest root mean square error. This method is tested on three NIR spectral datasets and compared against three high-performance variable selection methods. The experimental results show that the proposed algorithm has the highest efficiency and the best selection effect, and can usually locate the optimal combination of key wavelength variables in a dataset. The evaluation result after PLS modeling is also the best.

Highlights

  • With the characteristics of simple, rapid, noninvasive, costeffective and no sample pretreatment, near infrared spectroscopy has been widely adopted as a popular analytical tool for both qualitative and quantitative analysis in petrochemical, pharmaceutical, environmental, clinical, agricultural, food and biomedical fields [1]

  • The calibration set was used for variable selection and goodness of fit. and the independent test set was used to validate the calibration model for predictions, and the partial leastsquares (PLS) model based on NIPALS-PLS in our paper

  • In order to evaluate the performance of the significant multivariate competitive population analysis (SMCPA), we compared with the excellent methods competitive adaptive reweighted sampling method (CARS), VCPA and bootstrapping soft shrinkage method (BOSS) methods

Read more

Summary

INTRODUCTION

With the characteristics of simple, rapid, noninvasive, costeffective and no sample pretreatment, near infrared spectroscopy has been widely adopted as a popular analytical tool for both qualitative and quantitative analysis in petrochemical, pharmaceutical, environmental, clinical, agricultural, food and biomedical fields [1]. It is very important to study the sensitive, fast and accurate method of extracting relevant variables to predict the chemical composition of samples in chemometrics, and to develop a dimension reduction method to address those problems [5] Projection methods such as partial leastsquares (PLS) [6] and principal component regression(PCR) [7] have been widely used to address these high-dimensional collinear data. With more than one objective function optimized simultaneously, and ensuring that the PLS modeling the selected optimal subset of variables by SMCPA will have better prediction accuracy and simpler with greater interpretability. MONTE CARLO SAMPLING OF DATA SUBSETS MCS is an important statistical tool for analyzing complex (multivariate) problems [45] It is a stochastic method based on the use of random numbers and probability statistics. The absolute value of the regression coefficient is used as a guide, but this does not always reflect the true importance of the variable, and is affected by factors such as noise

THE SIGNIFICANT MULTIVARIATE CORRELATION
EXPONENTIALLY DECREASING FUNCTION
THE WEIGHTED BOOTSTRAP SAMPLING
RESULTS AND DISCUSSION
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call