Abstract

Variable selection is a universal problem in building multivariate calibration models, such as quantitative structure-activity relationship (QSAR) and quantitative relationships between quantity or property and spectral data. Significant improvement in the prediction ability of the models can be achieved by reducing the bias induced by the uninformative variables. A new criterion, named as C, is proposed in this study to evaluate the importance of the variables in a model. The value of C is defined as the average contribution of a variable to the model, which is calculated by the statistics of the models built with different combinations of the variables. In the calculation, a large number of partial least squares (PLS) models are built using a subset of variables selected by randomly re-sampling. Then, a vector of the prediction errors, in terms of root mean squared error of cross validation (RMSECV), and a matrix composed of 1 and 0 indicating the selected and unselected variables can be obtained. If multiple linear regression (MLR) is employed to model the relationship between the RMSECVs and the matrix, the coefficients of the MLR model can be used as a criterion to evaluate the contribution of a variable to the RMSECV. To enhance the efficiency of the method, a multi-step shrinkage strategy was used. Comparison with Monte Carlo-uninformative variables elimination (MC-UVE), randomization test (RT) and competitive adaptive reweighted sampling (CARS) was conducted using three NIR benchmark datasets. The results show that the proposed criterion is effective for selecting the informative variables from the spectra to improve the prediction ability of models.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call