Abstract
Feature selection, which is important for successful analysis of chemometric data, aims to produce parsimonious and predictive models. Partial least squares (PLS) regression is one of the main methods in chemometrics for analyzing multivariate data with input X and response Y by modeling the covariance structure in the X and Y spaces. Recently, orthogonal projections to latent structures (OPLS) has been widely used in processing multivariate data because OPLS improves the interpretability of PLS models by removing systematic variation in the X space not correlated to Y. The purpose of this paper is to present a feature selection method of multivariate data through orthogonal PLS regression (OPLSR), which combines orthogonal signal correction with PLS. The presented method generates empirical distributions of features effects upon Y in OPLSR vectors via permutation tests and examines the significance of the effects of the input features on Y. We show the performance of the proposed method using a simulation study in which a three-layer network structure exists in compared with the false discovery rate method. To demonstrate this method, we apply it to both real-life NIR spectra data and mass spectrometry data.
Highlights
Feature selection is a technique to select a subset of variables which are useful in predicting target responses
Contemporary analytic methods such as near-infrared (NIR), proton nuclear magnetic resonance (1H NMR) spectroscopy, liquid chromatography-mass spectrometry (LC-MS), and gas chromatography-mass spectrometry (GC-MS) provide highdimensional data sets in which the number of features is usually larger than the number of observations
We summarize the basic steps as follows and link them to Partial least squares (PLS) to obtain orthogonal-signalcorrected PLS regression (OPLSR) vectors that will be used for feature selection
Summary
Feature selection is a technique to select a subset of variables which are useful in predicting target responses. Application of the OPLSR method to the data matrix X of size 127 × 7068 and response Y of size 127 × 1 yielded the empirical null distribution of OPLS regression vector βOPLSb. Figure 3 shows two-sided 95% confidence intervals (red rectangles) and significant m/z variables (black circles) using the permutation testing of OPLS regression. It is not surprising because the proposed method examines the effects of variables in a collective manner by considering the covariance structure and the effects of both orthogonal and predictive components at the same time. For the down-sampled data, Table 4 presents the performance comparison for the four methods, demonstrating the degree of robustness
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have