Group sparse canonical correlation analysis for genomic data integration.

Dongdong Lin,Vince D Calhoun,Hong-Wen Deng,Yu-Ping Wang,Jingyao Li,Jigang Zhang

doi:10.1186/1471-2105-14-245

Abstract

BackgroundThe emergence of high-throughput genomic datasets from different sources and platforms (e.g., gene expression, single nucleotide polymorphisms (SNP), and copy number variation (CNV)) has greatly enhanced our understandings of the interplay of these genomic factors as well as their influences on the complex diseases. It is challenging to explore the relationship between these different types of genomic data sets. In this paper, we focus on a multivariate statistical method, canonical correlation analysis (CCA) method for this problem. Conventional CCA method does not work effectively if the number of data samples is significantly less than that of biomarkers, which is a typical case for genomic data (e.g., SNPs). Sparse CCA (sCCA) methods were introduced to overcome such difficulty, mostly using penalizations with l-1 norm (CCA-l1) or the combination of l-1and l-2 norm (CCA-elastic net). However, they overlook the structural or group effect within genomic data in the analysis, which often exist and are important (e.g., SNPs spanning a gene interact and work together as a group).ResultsWe propose a new group sparse CCA method (CCA-sparse group) along with an effective numerical algorithm to study the mutual relationship between two different types of genomic data (i.e., SNP and gene expression). We then extend the model to a more general formulation that can include the existing sCCA models. We apply the model to feature/variable selection from two data sets and compare our group sparse CCA method with existing sCCA methods on both simulation and two real datasets (human gliomas data and NCI60 data). We use a graphical representation of the samples with a pair of canonical variates to demonstrate the discriminating characteristic of the selected features. Pathway analysis is further performed for biological interpretation of those features.ConclusionsThe CCA-sparse group method incorporates group effects of features into the correlation analysis while performs individual feature selection simultaneously. It outperforms the two sCCA methods (CCA-l1 and CCA-group) by identifying the correlated features with more true positives while controlling total discordance at a lower level on the simulated data, even if the group effect does not exist or there are irrelevant features grouped with true correlated features. Compared with our proposed CCA-group sparse models, CCA-l1 tends to select less true correlated features while CCA-group inclines to select more redundant features.

Highlights

The emergence of high-throughput genomic datasets from different sources and platforms (e.g., gene expression, single nucleotide polymorphisms (SNP), and copy number variation (CNV)) has greatly enhanced our understandings of the interplay of these genomic factors as well as their influences on the complex diseases
In this paper, we propose a group sparse canonical correlation analysis (CCA) model to explore the correlation between two different types of genomic data
Our algorithm for solving CCA-l1 is very similar to that in [7] and we show that the algorithms for CCA-l1 and CCA-elastic net will converge to the same solution under a particular condition

Summary

Introduction

The emergence of high-throughput genomic datasets from different sources and platforms (e.g., gene expression, single nucleotide polymorphisms (SNP), and copy number variation (CNV)) has greatly enhanced our understandings of the interplay of these genomic factors as well as their influences on the complex diseases. The development of a variety of affordable high throughput genome-wide assays enables multiple measurements of genomic markers from different platforms and/or scales for the same subject, e.g., gene expression, single nucleotide polymorphisms (SNP), copy number variation, and proteomic data. The co-expressed and coregulated genes and their associating SNPs. Different from the regression based integrative methods (i.e., with principle component analysis and PLS), CCA focus on the canonical correlation framework without more prior knowledge of which type of omic data is explained or regressed by the another one (e.g., with transcripts and metabolites). This high dimensionality can result in possible multi-collinearity (linear dependence) problem, and computational difficulty [7]

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Aug 12, 2013
Citations: 144	License type: cc-by

R Discovery Prime

R Discovery Prime

Group sparse canonical correlation analysis for genomic data integration.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Fast Multi-Task SCCA Learning with Feature Selection for Multi-Modal Brain Imaging Genetics.
Lei Du ... Andrew J Saykin
Proceedings. IEEE International Conference on Bioinformatics and Biomedicine | VOL. 2018
Lei Du, et. al.Lei Du ... Andrew J Saykin
01 Dec 2018
Proceedings. IEEE International Conference on Bioinformatics and Biomedicine | VOL. 2018

Multi-Task Sparse Canonical Correlation Analysis with Application to Multi-Modal Brain Imaging Genetics.
Lei Du ... Lei Guo
IEEE/ACM Transactions on Computational Biology and Bioinformatics | VOL. 18
Lei Du, et. al.Lei Du ... Lei Guo
01 Jan 2020
IEEE/ACM Transactions on Computational Biology and Bioinformatics | VOL. 18

The group sparse canonical correlation analysis method in the imaging genetics research
Jie Wu ... Wei Chen
-
Jie Wu, et. al.Jie Wu ... Wei Chen
16 Dec 2020
16 Dec 2020

An Improved Multi-Task Sparse Canonical Correlation Analysis of Imaging Genetics for Detecting Biomarkers of Alzheimer’s Disease
Kai Wei ... Shuaiqun Wang
IEEE Access | VOL. 9
Kai Wei, et. al.Kai Wei ... Shuaiqun Wang
01 Jan 2020
IEEE Access | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Group sparse canonical correlation analysis for genomic data integration.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics