Abstract
Genomics-based technologies produce large amounts of data. To interpret the results and identify the most important variates related to phenotypes of interest, various multivariate regression and variate selection methods are used. Although inspected for statistical performance, the relevance of multivariate models in interpreting biological data sets often remains elusive. We compare various multivariate regression and variate selection methods applied to a nutrigenomics data set in terms of performance, utility and biological interpretability. The studied data set comprised hepatic transcriptome (10,072 predictor variates) and plasma protein concentrations [2 dependent variates: Leptin (LEP) and Tissue inhibitor of metalloproteinase 1 (TIMP-1)] collected during a high-fat diet study in ApoE3Leiden mice. The multivariate regression methods used were: partial least squares “PLS”; a genetic algorithm-based multiple linear regression, “GA-MLR”; two least-angle shrinkage methods, “LASSO” and “ELASTIC NET”; and a variant of PLS that uses covariance-based variate selection, “CovProc.” Two methods of ranking the genes for Gene Set Enrichment Analysis (GSEA) were also investigated: either by their correlation with the protein data or by the stability of the PLS regression coefficients. The regression methods performed similarly, with CovProc and GA performing the best and worst, respectively (R-squared values based on “double cross-validation” predictions of 0.762 and 0.451 for LEP; and 0.701 and 0.482 for TIMP-1). CovProc, LASSO and ELASTIC NET all produced parsimonious regression models and consistently identified small subsets of variates, with high commonality between the methods. Comparison of the gene ranking approaches found a high degree of agreement, with PLS-based ranking finding fewer significant gene sets. We recommend the use of CovProc for variate selection, in tandem with univariate methods, and the use of correlation-based ranking for GSEA-like pathway analysis methods.Electronic supplementary materialThe online version of this article (doi:10.1007/s12263-012-0288-4) contains supplementary material, which is available to authorized users.
Highlights
In many life science studies, large data sets are generated from metabolomics, proteomics and transcriptomics experiments
An ideal variate selection method has principles and parameters that are well-suited to the particular study goal and/or to the data characteristics, it is not always straightforward to make these choices in advance
This study has compared five methods currently used for variate selection or ranking: Partial least squares (PLS), Genetic algorithm (GA), Least absolute shrinkage and selection operator (LASSO)/ELASTIC NET and CovProc
Summary
In many life science studies, large data sets are generated from metabolomics, proteomics and transcriptomics experiments. Genes Nutr (2012) 7:387–397 biomarkers or crucial pathways associated with the original study goal. Statistical models are generated that describe the relationship between the genomics data and some feature of interest (e.g., a phenotype). Many variate selection methods are described in the literature. These can differ in their implementation details or in their fundamental statistical principles (Guyon and Elisseeff 2003; Guyon et al 2006). An ideal variate selection method has principles and parameters that are well-suited to the particular study goal and/or to the data characteristics, it is not always straightforward to make these choices in advance. Even though the statistical principles of a method may be understood, its utility from a biological perspective is often less obvious
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.