Abstract
BackgroundPartial Least-Squares Discriminant Analysis (PLS-DA) is a popular machine learning tool that is gaining increasing attention as a useful feature selector and classifier. In an effort to understand its strengths and weaknesses, we performed a series of experiments with synthetic data and compared its performance to its close relative from which it was initially invented, namely Principal Component Analysis (PCA).ResultsWe demonstrate that even though PCA ignores the information regarding the class labels of the samples, this unsupervised tool can be remarkably effective as a feature selector. In some cases, it outperforms PLS-DA, which is made aware of the class labels in its input. Our experiments range from looking at the signal-to-noise ratio in the feature selection task, to considering many practical distributions and models encountered when analyzing bioinformatics and clinical data. Other methods were also evaluated. Finally, we analyzed an interesting data set from 396 vaginal microbiome samples where the ground truth for the feature selection was available. All the 3D figures shown in this paper as well as the supplementary ones can be viewed interactively at http://biorg.cs.fiu.edu/plsdaConclusionsOur results highlighted the strengths and weaknesses of PLS-DA in comparison with PCA for different underlying data models.
Highlights
Partial Least-Squares Discriminant Analysis (PLS-DA) is a popular machine learning tool that is gaining increasing attention as a useful feature selector and classifier
Results we discuss a variety of experiments with synthetic and real data that will help us explain the strenghts and weaknesses of PLS-DA vis-á-vis Principal Component Analysis (PCA) and other tools
We found that PCA-based algorithms (PCA and Sparse Principal Component Analysis (SPCA)) have similar overall performance among the three experiments
Summary
Partial Least-Squares Discriminant Analysis (PLS-DA) is a popular machine learning tool that is gaining increasing attention as a useful feature selector and classifier. Partial Least-Squares Discriminant Analysis (PLS-DA) is a multivariate dimensionality-reduction tool [1, 2] that has been popular in the field of chemometrics for well over two decades [3], and has been recommended for use in omics data analyses. PLS-DA is gaining popularity in metabolomics and in other integrative omics analyses [4,5,6]. Both chemometrics and omics data sets are characterized by large volume, large number of features, noise and missing data [2, 7]. Besides its use for dimensionality-reduction, it can be adapted
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have