Abstract
AbstractModel selection is an important issue when constructing multivariate calibration models using methods based on latent variables (e.g. partial least squares regression and principal component regression). It is important to select an appropriate number of latent variables to build an accurate and precise calibration model. Inclusion of too few latent variables can result in a model that is inaccurate over the complete space of interest. Inclusion of too many latent variables can result in a model that produces noisy predictions through incorporation of low‐order latent variables that have little or no predictive value. Commonly used metrics for selecting the number of latent variables are based on the predicted error sum of squares (PRESS) obtained via cross‐validation. In this paper a new approach for selecting the number of latent variables is proposed. In this new approach the prediction errors of individual observations (obtained from cross‐validation) are compared across models incorporating varying numbers of latent variables. Based on these comparisons, non‐parametric statistical methods are used to select the simplest model (least number of latent variables) that provides prediction quality that is indistinguishable from that provided by more complex models. Unlike methods based on PRESS, this new approach is robust to the effects of anomalous observations. More generally, the same approach can be used to compare the performance of any models that are applied to the same data set where reference values are available. The proposed methodology is illustrated with an industrial example involving the prediction of gasoline octane numbers from near‐infrared spectra. Published in 2004 by John Wiley & Sons, Ltd.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have