Abstract

MotivationInternational consortia such as the Genotype-Tissue Expression (GTEx) project, The Cancer Genome Atlas (TCGA) or the International Human Epigenetics Consortium (IHEC) have produced a wealth of genomic datasets with the goal of advancing our understanding of cell differentiation and disease mechanisms. However, utilizing all of these data effectively through integrative analysis is hampered by batch effects, large cell type heterogeneity and low replicate numbers. To study if batch effects across datasets can be observed and adjusted for, we analyze RNA-seq data of 215 samples from ENCODE, Roadmap, BLUEPRINT and DEEP as well as 1336 samples from GTEx and TCGA. While batch effects are a considerable issue, it is non-trivial to determine if batch adjustment leads to an improvement in data quality, especially in cases of low replicate numbers.ResultsWe present a novel method for assessing the performance of batch effect adjustment methods on heterogeneous data. Our method borrows information from the Cell Ontology to establish if batch adjustment leads to a better agreement between observed pairwise similarity and similarity of cell types inferred from the ontology. A comparison of state-of-the art batch effect adjustment methods suggests that batch effects in heterogeneous datasets with low replicate numbers cannot be adequately adjusted. Better methods need to be developed, which can be assessed objectively in the framework presented here.Availability and implementationOur method is available online at https://github.com/SchulzLab/OntologyEval.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • A growing number of international consortia such as The Cancer Genome Atlas (TCGA), the Genotype-Tissue Expression (GTEx) project and the International Human Epigenome Consortium (IHEC) have generated a wealth of epigenomic profiling data of cell lines, sorted primary cells and tissue samples

  • We briefly discuss quantitative alternatives and outline why they are not suitable for assessing batch effect adjustment (BEA) on heterogeneous datasets with low replicate numbers, before we present and evaluate an alternative approach based on leveraging information from an ontology

  • We developed the ontology score to assess if BEA is beneficial on heterogeneous datasets

Read more

Summary

Introduction

A growing number of international consortia such as The Cancer Genome Atlas (TCGA), the Genotype-Tissue Expression (GTEx) project and the International Human Epigenome Consortium (IHEC) have generated a wealth of epigenomic profiling data of cell lines, sorted primary cells and tissue samples. These data will be of tremendous help in unraveling mechanisms of cell differentiation and in identifying patterns of epigenetic dysregulation in various diseases. A number of studies have shown that joint analysis of data from multiple projects enable novel applications of biological relevance (Cao et al, 2017; Zang et al, 2016) These integrative analyses are often hampered by batch effects, i.e. variation between datasets that is of technical origin and does not reflect biological variation.

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call