Using elastic net regression to perform spectrally relevant variable selection

Cannon Giglio,Steven D Brown

doi:10.1002/cem.3034

Abstract

AbstractMultivariate data such as spectra frequently contain measured variables that are uninformative, and removal of such variables requires the use of methods that can be used to select informative variables. Partial least squares (PLS) regression may incorporate information from uninformative measured variables, and so it is important to select variables before performing the PLS regression. Elastic net (EN) regression can be used to perform variable selection automatically. An EN regression can be used to select groups of correlated variables or to select either sparse or nonsparse sets of variables. However, the predictive performance of the EN regression can be significantly worse than competing 1‐step variable selection methods such as variable importance in projection (VIP). In the present work, the use of the EN to select variables, followed by conventional PLS regression on the selected variables (EN‐PLS), has been investigated. Variable selection by using EN‐PLS was compared with that from EN regression, sparse PLS regression, VIP, and from selectivity ratio selection on 2 data sets of visible/near‐infrared spectra. In all cases, the wavelengths selected were compared with reference data. The variables selected by using EN‐PLS offered advantages in interpretability and gave more robust prediction performance as compared with those obtained from full‐spectrum PLS and the other variable selection methods. This paper reports a method for variable selection by using an EN regression prior to a second regression by using PLS, a 2‐step method termed EN‐PLS. Variables selected by using EN‐PLS are compared with variables selected from the EN regression, as well as VIP, selectivity ratio, and the sparse PLS regression, 3 commonly used methods for variable selection in chemometrics. The EN‐PLS is shown to select variables that were more easily interpreted. In addition, EN‐PLS performed more robustly than a PLS regression performed on all variables, as well as reduced PLS regressions by using variables selected from either the sparse PLS regression algorithm or a VIP variable selection followed by PLS modeling.

Full Text