Abstract

BackgroundVariable selection on high throughput biological data, such as gene expression or single nucleotide polymorphisms (SNPs), becomes inevitable to select relevant information and, therefore, to better characterize diseases or assess genetic structure. There are different ways to perform variable selection in large data sets. Statistical tests are commonly used to identify differentially expressed features for explanatory purposes, whereas Machine Learning wrapper approaches can be used for predictive purposes. In the case of multiple highly correlated variables, another option is to use multivariate exploratory approaches to give more insight into cell biology, biological pathways or complex traits.ResultsA simple extension of a sparse PLS exploratory approach is proposed to perform variable selection in a multiclass classification framework.ConclusionssPLS-DA has a classification performance similar to other wrapper or sparse discriminant analysis approaches on public microarray and SNP data sets. More importantly, sPLS-DA is clearly competitive in terms of computational efficiency and superior in terms of interpretability of the results via valuable graphical outputs. sPLS-DA is available in the R package mixOmics, which is dedicated to the analysis of large biological data sets.

Highlights

  • Introduction on Partial Least Squares regression (PLS) Discriminant Analysis Partial Least Squares [13] was not originally designed for classification and discrimination problems, it has often been used for that purpose [38,51]

  • Results and Discussion we compare our proposed sparse Partial Least Squares Discriminant Analysis (sPLS-DA) approach with other sparse exploratory approaches such as two sparse Linear Discriminant Analyses (LDA) proposed by [41], and three other versions of sparse PLS from [30]

  • We first discuss the choice of the number of dimensions H to choose with sPLS-DA, the classification performance obtained with the tested approaches and the computational time required for the exploratory approaches

Read more

Summary

Introduction

Introduction on PLS Discriminant Analysis Partial Least Squares [13] was not originally designed for classification and discrimination problems, it has often been used for that purpose [38,51]. In a supervised classification framework, one solution is to reduce the dimensionality of the data either by performing feature selection, or by introducing artificial variables that summarize most of the information For this purpose, many approaches have been proposed in the microarray literature. Other approaches were used for exploratory purposes and to give more insight into biological studies This is the case of Linear Discriminant Analysis (LDA), Principal Component Analysis (PCA, see [11,12] for a supervised version), Partial Least Squares Regression (PLS, [13], see [14,15,16] for discrimination purposes), to explain most of the variance/covariance structure of the data using linear combinations of the original variables. Another limitation of the approaches cited above is the lack of interpretability when dealing with a large number of variables

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call