Variable importance: Comparison of selectivity ratio and significance multivariate correlation for interpretation of latent‐variable regression models

Olav M Kvalheim

doi:10.1002/cem.3211

Abstract

AbstractThis work examines the performance of significance multivariate correlation (sMC) and selectivity ratio (SR) for ranking variables according to their importance in latent‐variable regressions (LVRs) models. Both indices are based on target projection (TP) of a validated LVR model obtained by partial least squares (PLS). The matrix of explanatory x‐variables is projected on the normalized regression vector to obtain a score vector that is proportional to the vector of predicted values for the response variable y. sMC for each x‐variable is calculated by dividing the squared variance explained by the decomposition obtained from these two vectors on the squared residuals. This is similar to how SR is calculated except that for SR, the regression vector is replaced by the loading matrix obtained by projecting the data matrix of x‐variables onto the score matrix obtained by TP. The two indices for variable importance are compared for three different applications with data representing instrumental profiles from liquid chromatography, infrared spectroscopy, and proton nuclear magnetic spectroscopy. Results show that SR outperforms sMC for interpretation and biomarker selection. The main drawback of sMC appears to be the mixing of predictive and orthogonal variation resulting from the direct use of the normalized regression vector in the calculation. SR uses a loading vector that is proportional to the covariances between x‐variables and the predicted response variable.

Highlights

Measures for variable importance are crucial for interpretation and biomarker selection using partial least squares (PLS)[1] or any other method based on latent‐variable regression (LVR) modeling.[2,3]
The first application area usually involves a continuous response variable measuring some kind of bioactivity of the total extract, while disease patterns are typically revealed by using a binary response variable describing the condition as healthy or ill, the so‐called PLS discriminant analysis (PLS‐DA)
The nuclear magnetic resonance (NMR) profiles were aligned to the lactate doublet at approximately 1.32 ppm, and the shift regions embracing the lipoprotein methylene peak, 1.30 to 1.19 ppm, the lipoprotein methyl peak, 0.90 to 0.78 ppm, and the peak located at 0.70 to 0.62 ppm were selected as explanatory variables

Summary

| INTRODUCTION

Measures for variable importance are crucial for interpretation and biomarker selection using partial least squares (PLS)[1] or any other method based on latent‐variable regression (LVR) modeling.[2,3] Many measures and visualizations[4]. Proton nuclear magnetic resonance (NMR) spectroscopy using nuclear Overhauser effect spectroscopy (NOESY) was performed for the same samples according to a procedure described in previous work.[21] The NMR profiles were aligned to the lactate doublet at approximately 1.32 ppm, and the shift regions embracing the lipoprotein methylene peak, 1.30 to 1.19 ppm, the lipoprotein methyl peak, 0.90 to 0.78 ppm, and the peak located at 0.70 to 0.62 ppm were selected as explanatory variables. These regions are known to contain quantitative information about TC, LDL‐C, HDL‐C, and TG. We have multiplied sMC for the explanatory variables by the sign of the corresponding loading to be able to simplify visual comparison of the two indices

| RESULTS

| DISCUSSION

Findings

| CONCLUSION