Variable selection of spectroscopic data through monitoring both location and dispersion of PLS loading weights

Tahir Mehmood,Arslan Munir Turk

doi:10.1007/s42952-020-00098-x

Abstract

High dimensional data sets against the small sample size is essential for most of the sciences. The variable selection contributes to a better prediction of real-life phenomena. A multivariate approach called partial least squares (PLS) has the potential to model the high dimensional data, where the sample size is usually smaller than the number of variables. Truncation for variables selection in PLS $$T-PLS$$ is considered a reference method. $$T-PLS$$ and many others only monitors the location of PLS loading weights for variable selection. In the current article, we propose to monitor both location and dispersion of PLS loading weights for variable selection over the high dimensional spectral data. The proposed PLS variants are based on location, dispersion, both location and dispersion and at least location or dispersion monitoring of $$PLS$$ loading weights, and are denoted by $$X-PLS$$ , $$S-PLS$$ , $$X \& S-PLS$$ and $$X|S-PLS$$ respectively. Proposed PLS variants are compared with standard PLS and $$T-PLS$$ through the Monte Carlo simulation of 100 runs on simulated and real data sets which includes corn, milk, and oil contents prediction based on spectroscopic data. $$X \& S-PLS$$ shows the best capability in selecting the real variables over the simulated data. The validated RMSE comparison indicates $$X|S-PLS$$ and $$X \& S-PLS$$ outperforms compared to other methods in predicting corn, milk, and oil contents. $$X \& S-PLS$$ selects the smallest number of variables. Interestingly, selected variables by $$X \& S-PLS$$ are more consistent compared to all other methods. Hence $$X \& S-PLS$$ appears a potential candidate for variable selection in high dimensional data.

Full Text