MODMatcher: multi-omics data matcher for integrative genomic analysis.

Seungyeul Yoo,Mark W Geraci,Eric E Schadt,Zhidong Tu,Jun Zhu,Avrum Spira,Joshua D Campbell,Tao Huang,Eunjee Lee,Charles A Powell

doi:10.1371/journal.pcbi.1003790

Seungyeul Yoo, Mark W Geraci + Show 8 more

Open Access

https://doi.org/10.1371/journal.pcbi.1003790

Copy DOI

Abstract

Errors in sample annotation or labeling often occur in large-scale genetic or genomic studies and are difficult to avoid completely during data generation and management. For integrative genomic studies, it is critical to identify and correct these errors. Different types of genetic and genomic data are inter-connected by cis-regulations. On that basis, we developed a computational approach, Multi-Omics Data Matcher (MODMatcher), to identify and correct sample labeling errors in multiple types of molecular data, which can be used in further integrative analysis. Our results indicate that inspection of sample annotation and labeling error is an indispensable data quality assurance step. Applied to a large lung genomic study, MODMatcher increased statistically significant genetic associations and genomic correlations by more than two-fold. In a simulation study, MODMatcher provided more robust results by using three types of omics data than two types of omics data. We further demonstrate that MODMatcher can be broadly applied to large genomic data sets containing multiple types of omics data, such as The Cancer Genome Atlas (TCGA) data sets.

Highlights

Cells employ multiple levels of regulation that enable them to respond to genetic, epigenetic, genomic, and environmental perturbations
Many human diseases are complex with multiple genetic and environmental causal factors interacting together to give rise to disease phenotypes. Such factors affect biological systems through many layers of regulations, including transcriptional and epigenetic regulation, and protein changes. To fully understand their molecular mechanisms, complex diseases are often studied in diverse dimensions including genetics (genotype variations by single nucleotide polymorphism (SNP) arrays or whole exome sequencing), transcriptomics, epigenetics, and proteomics
We developed a computational approach, MultiOmics Data Matcher (MODMatcher), to identify and correct sample labeling errors based on multiple types of molecular data before further integrative analysis

Summary

Introduction

Cells employ multiple levels of regulation that enable them to respond to genetic, epigenetic, genomic, and environmental perturbations. Sample errors, including sample swapping, mis-labeling, and improper data entry are inevitable during large-scale data generation Some of these errors can be detected during quality control (QC) on each type of data; others are more elusive and may affect integrative data analysis, depending on the integration methods used. Assume that samples A and B are swapped in gene expression data If both samples are involved in the same subgroup (e.g., normal control or disease), the derived signatures will not be affected by the sample mis-labeling error. In other integrative analyses, such as the genetic gene expression studies [2,3], in which the aim is to discover how DNA variations or single nucleotide polymorphisms (SNPs) regulate gene expression changes, sample errors could have a larger effect. Mis-matching of 20% of samples between genotype and gene expression data decreased the number of cis-eSNPs by 70% [4]

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLoS Computational Biology	Publication Date: Aug 14, 2014
Citations: 36	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

MODMatcher: multi-omics data matcher for integrative genomic analysis.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS Computational Biology

Lead the way for us

Similar Papers

Simultaneous Integration of Multi-omics Data Improves the Identification of Cancer Driver Modules.
Dana Silverbush ... Simona Cristea
Cell Systems | VOL. 8
Dana Silverbush, et. al.Dana Silverbush ... Simona Cristea
01 May 2019
Cell Systems | VOL. 8

A multi-omics data simulator for complex disease studies and its application to evaluate multi-omics data analysis methods for disease classification.
Ren-Hua Chung ... Chen-Yu Kang
GigaScience | VOL. 8
Ren-Hua Chung, et. al.Ren-Hua Chung ... Chen-Yu Kang
26 Apr 2019
GigaScience | VOL. 8

Integrating Multidimensional Data for Clustering Analysis With Applications to Cancer Patient Data
Seyoung Park ... Hongyu Zhao
Journal of the American Statistical Association | VOL. 116
Seyoung Park, et. al.Seyoung Park ... Hongyu Zhao
19 Mar 2020
Journal of the American Statistical Association | VOL. 116

An overview of advances in multi-omics analysis in prostate cancer
Enchong Zhang ... Yongsheng Song
Life Sciences | VOL. 260
Enchong Zhang, et. al.Enchong Zhang ... Yongsheng Song
06 Sep 2020
Life Sciences | VOL. 260

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MODMatcher: multi-omics data matcher for integrative genomic analysis.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS Computational Biology