Propagation of measurement errors for the validation of predictions obtained by principal component regression and partial least squares

Klaas Faber,Bruce R Kowalski

doi:10.1002/(sici)1099-128x(199705)11:3<181::aid-cem459>3.0.co;2-7

Klaas Faber, Bruce R Kowalski

https://doi.org/10.1002/(sici)1099-128x(199705)11:3<181::aid-cem459>3.0.co;2-7

Copy DOI

Abstract

Multivariate calibration aims to model the relation between a dependent variable, e.g. analyte concentration, and the measured independent variables, e.g. spectra, for complex mixtures. The model parameters are obtained in the form of a regression vector from calibration data by regression methods such as principal component regression (PCR) or partial least squares (PLS). Subsequently, this regression vector is used to predict the dependent variable for unknown mixtures. The validation of the obtained predictions is a crucial part of the procedure, i.e. together with the point estimate an interval estimate is desired. The associated prediction intervals can be constructed from the covariance matrix of the estimated regression vector. However, currently known expressions for PCR and PLS are derived within the classical regression framework, i.e. they only take the uncertainty in the dependent variable into account. This severely limits their capability for establishing realistic prediction intervals in practical situations. In this paper, expressions are derived using the method of error propagation that also account for the measurement errors in the independent variables. An exact linear relation is assumed between the dependent and independent variables. The obtained expressions are therefore valid for the classical errors-in-variables (EIV) model. In order to make the presentation reasonably self-contained, relevant expressions are reviewed for the classical regression model as well as the classical EIV model, especially for ordinary least squares (OLS). The consequences for the limit of detection, wavelength selection, sample selection and local modeling are discussed. Diagnostics are proposed to determine the adequacy of the approximations used in the derivations. Finally, PCR and PLS are so-called biased regression methods. Compared with OLS, they yield small variance at the expense of increased bias. It follows that bias may be an important ingredient of the obtained predictions. Therefore considerable attention is paid to the quantification of bias and new stopping rules for model selection in PCR and PLS are proposed. The theoretical ideas are illustrated by the analysis of real data taken from the literature (classical regression model) as well as simulated data (classical EIV model). © 1997 John Wiley & Sons, Ltd.

Full Text