BackgroundPLS-DA of high-dimensional metabolomics data is frequently employed to capture the most pertinent features to sample classification. But the presence of numerous insignificant input features could distort the PLS-DA model, blow up and scramble the selected differential features. Usually, univariate filtration is subsequently complemented to refine the selected features, but often giving unstable results. Whereas by precluding insignificant features through univariate data prefiltration assessed by FDR adjusted p-value, PLS-DA can generate more stable and reliable differential features. We explored and compared these two data analysis procedures to gain insights into the underlying mechanisms responsible for the disparate results. ResultsThe effect of univariate data filtration preceding and succeeding PLS-DA analysis on the identified discriminative features/metabolites was investigated using LC-MS data acquired on the samples of human serum and C. elegans extracts, with and without metabolite standards spiked to simulate the treated and control groups of biological samples. It was shown that the univariate data prefiltration before PLS-DA usually gave less but more stable and likely more reliable and meaningful differential features, while PLS-DA applied directly to the original data could be affected by the presence of insignificant features and orthogonal noise. Large number of insignificant variables and orthogonal noise could distort the generated PLS-DA model and affect the p(corr) value, and artificially inflate the calculated VIP values of relevant features due to the increased total number of input features for model construction, thus leading to more false positives selected by the conventional VIP threshold of 1.0. Significance and noveltyUnivariate data filtration preceding PLS-DA was important for the identification of reliable differential features if using a conventional threshold of VIP of 1.0. Presence of insignificant features could distort the PLS-DA model and inflate VIP values. Appropriate VIP threshold is associated with the numbers of input features and the model components. For PLS-DA without univariate prefiltration, threshold of VIP larger than 1.0 is recommended for the selection of discriminative features to reduce the false positives.
Read full abstract