On estimating model complexity and prediction errors in multivariate calibration: generalized resampling by random sample weighting (RSW)

L Xu,Q.‐S Xu,M Yang,H.‐Z Zhang,R.‐Q Yu,H.‐L Wu,J.‐H Jiang,C.‐B Cai

doi:10.1002/cem.1323

Abstract

AbstractThe present paper focuses on determining the number of PLS components by using resampling methods such as cross validation (CV), Monte Carlo cross validation (MCCV), bootstrapping (BS), etc. To resample the training data, random non‐negative weights are assigned to the original training samples and a sample‐weighted PLS model is developed without increasing the computational burden much. Random weighting is a generalization of the traditional resampling methods and is expected to have a lower risk of getting an insufficient training set. For prediction, only the training samples with random weights less than a threshold value are selected to ensure that the prediction samples have less influence on training. For complicated data, because the optimal number of PLS components is often not unique or readily distinguished and there might exist an optimal region of model complexity, the distribution of prediction errors can be more useful than a single value of root mean squared error of prediction (RMSEP). Therefore, the distribution of prediction errors are estimated by repeated random sample weighting and used to determine model complexity. RSW is compared with its traditional counterparts like CV, MCCV, BS and a recently proposed randomization test method to demonstrate its usefulness. Copyright © 2010 John Wiley & Sons, Ltd.

Full Text