Abstract

BackgroundGene expression data can be compromised by cells originating from other tissues than the target tissue of profiling. Failures in detecting such tissue heterogeneity have profound implications on data interpretation and reproducibility. A computational tool explicitly addressing the issue is warranted.ResultsWe introduce BioQC, a R/Bioconductor software package to detect tissue heterogeneity in gene expression data. To this end BioQC implements a computationally efficient Wilcoxon-Mann-Whitney test and provides more than 150 signatures of tissue-enriched genes derived from large-scale transcriptomics studies.Simulation experiments show that BioQC is both fast and sensitive in detecting tissue heterogeneity. In a case study with whole-organ profiling data, BioQC predicted contamination events that are confirmed by quantitative RT-PCR. Applied to transcriptomics data of the Genotype-Tissue Expression (GTEx) project, BioQC reveals clustering of samples and suggests that some samples likely suffer from tissue heterogeneity.ConclusionsOur experience with gene expression data indicates a prevalence of tissue heterogeneity that often goes unnoticed. BioQC addresses the issue by integrating prior knowledge with a scalable algorithm. We propose BioQC as a first-line tool to ensure quality and reproducibility of gene expression data.

Highlights

  • Gene expression data can be compromised by cells originating from other tissues than the target tissue of profiling

  • As we examined genes in the signature, we observed substantial expression of many genes including insulin (INS), glucagon (GCG), and pancreatic carboxypeptidase A1 (CPA1) in the three samples (Fig. 2b)

  • We quantified expression of amylase (AMY1A) and elastase (CELA1), both highly expressed in pancreas and absent in kidney according to Genotype-Tissue Expression (GTEx) [7] and Human Protein Atlas [20], with quantitative Reverse transcription polymerase chain reaction (RT-PCR)

Read more

Summary

Results

We apply BioQC to simulated and real-world datasets to demonstrate its use. All computations are performed on a single thread of a 4-core laptop with 8G memory running R-3.2.0 in 64-bit Linux MINT (version 16) if not otherwise specified. The asymmetry is likely caused by the relatively high expression of heartenriched genes compared with small-intestine-specific Following this example, we mixed all pairs of canine tissues and found that on average BioQC is able to detect heterogeneity with 20% or more contamination (enrichment score 3.0 or rank 10, Figure 3 in Additional file 3: Document 2). Simulation studies with model-generated and real-world data demonstrate that BioQC is scalable and sensitive in detecting tissue heterogeneity. We have integrated BioQC in our gene expression analysis pipeline since three years to routinely detect tissue heterogeneity in internal and external studies It has raised warning flags in many datasets independent of the target tissue of profiling, organism, experiment design, profiling platform and laboratory.

Conclusions
Background
28. Supplementary
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call