Abstract

BackgroundInadvertent sample swaps are a real threat to data quality in any medium to large scale omics studies. While matches between samples from the same individual can in principle be identified from a few well characterized single nucleotide polymorphisms (SNPs), omics data types often only provide low to moderate coverage, thus requiring integration of evidence from a large number of SNPs to determine if two samples derive from the same individual or not.MethodsWe select about six thousand SNPs in the human genome and develop a Bayesian framework that is able to robustly identify sample matches between next generation sequencing data sets.ResultsWe validate our approach on a variety of data sets. Most importantly, we show that our approach can establish identity between different omics data types such as Exome, RNA-Seq, and MethylCap-Seq. We demonstrate how identity detection degrades with sample quality and read coverage, but show that twenty million reads of a fairly low quality RNA-Seq sample are still sufficient for reliable sample identification.ConclusionOur tool, SMASH, is able to identify sample mismatches in next generation sequencing data sets between different sequencing modalities and for low quality sequencing data.

Highlights

  • Inadvertent sample swaps are a real threat to data quality in any medium to large scale omics studies

  • Approach In order to computationally identify samples that are derived from the same individual, we select a set of common Single Nucleotide Polymorphism (SNP) that we use as the genetic fingerprint of the individual

  • In order to maximize our ability to apply our method to data sets from different library preparation methods we select SNPs in genomic locations that are covered by exome sequencing and various RiboNucleic Acid (RNA) sequencing approaches

Read more

Summary

Introduction

Inadvertent sample swaps are a real threat to data quality in any medium to large scale omics studies. While matches between samples from the same individual can in principle be identified from a few well characterized single nucleotide polymorphisms (SNPs), omics data types often only provide low to moderate coverage, requiring integration of evidence from a large number of SNPs to determine if two samples derive from the same individual or not. Because no laboratory tracking method is perfect, there is always a risk of error in sample identification in generation sequencing (NGS), which increases as the size and scope of a study increases [1]. Sequencing Core Laboratories and Genomic Centers utilize different instruments and protocols from center to center [2]. In tumor-normal, knock down/knock out analysis in primary cultures, or drug trial studies, an incorrectly identified sample can have egregious effects on the resulting data

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call