Abstract

A challenge of next generation sequencing is read contamination. We use Genotype-Tissue Expression (GTEx) datasets and technical metadata along with RNA-seq datasets from other studies to understand factors that contribute to contamination. Here we report, of 48 analyzed tissues in GTEx, 26 have variant co-expression clusters of four highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, and/or CELA3A). Fourteen additional highly expressed genes from other tissues also indicate contamination. Sample contamination is strongly associated with a sample being sequenced on the same day as a tissue that natively expresses those genes. Discrepant SNPs across four contaminating genes validate the contamination. Low-level contamination affects ~40% of samples and leads to numerous eQTL assignments in inappropriate tissues among these 18 genes. This type of contamination occurs widely, impacting bulk and single cell (scRNA-seq) data set analysis. In conclusion, highly expressed, tissue-enriched genes basally contaminate GTEx and other datasets impacting analyses.

Highlights

  • A challenge of generation sequencing is read contamination

  • Limitations exist for all –omics technologies, including RNA sequencing (RNA-Seq)

  • To identify other highly expressed tissue-enriched genes appearing variably in other samples, we cross-referenced a list of tissueenriched proteins generated by the Human Protein Atlas (HPA) to the Genotype-Tissue Expression (GTEx) transcripts per million (TPM) data (Table 1)[19,20]

Read more

Summary

Introduction

A challenge of generation sequencing is read contamination. We use Genotype-Tissue Expression (GTEx) datasets and technical metadata along with RNA-seq datasets from other studies to understand factors that contribute to contamination. Low-level contamination affects ~40% of samples and leads to numerous eQTL assignments in inappropriate tissues among these 18 genes This type of contamination occurs widely, impacting bulk and single cell (scRNA-seq) data set analysis. As cost per basepair decreases, more large-scale transcriptome projects can be performed that will inform on tissue expression patterns in health and disease[1,2,3,4] These data sources are generally publicly available and have been used by hundreds of researchers for secondary analyses of high impact[5,6]. Library preparation biases, and computational biases such as positional fragment bias are known limitations of RNA-Seq experiments[7,8,9] Another challenge of high-throughput RNA-Seq is contamination, leading to the presence of sequence data within a data set of one sample that originates from a separate sample. We further demonstrate the universality of highly expressed genes contaminating other samples

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call