A novel approach for the pre-selection of wavelengths, to be used in combination with Partial Least Squares (PLS) or other multivariate regression techniques, is presented. This variable selection method makes use of the purity function, originally suggested in the SIMPLe-to-use Interactive Self-modeling Mixture Analysis (SIMPLISMA) algorithm, to map up the regions of potentially influential variables. The selected intervals are then individually tested in practical modeling and prediction, and an optimal subset of variables is obtained. The algorithm is simple and intuitive and does not rely on iterative variable searches. The method was tested on a set of infrared protein spectra in order to improve the quantitative determination of the fractions of two secondary structure elements, α-helices and β-strands (β-sheets) in the protein polypeptide chain. Comparable results to those obtained through interval PLS (iPLS), an exhaustive search-based algorithm, were achieved in this study. Our method was shown to be particularly beneficial in combination with variable weighting by their inverse standard deviation.
Read full abstract