Abstract

The ability to generate whole genome data is rapidly becoming commoditized. For example, a mammalian sized genome (∼3Gb) can now be sequenced using approximately ten lanes on an Illumina HiSeq 2000. Since lanes from different runs are often combined, verifying that each lane in a genome's build is from the same sample is an important quality control. We sought to address this issue in a post hoc bioinformatic manner, instead of using upstream sample or “barcode” modifications. We rely on the inherent small differences between any two individuals to show that genotype concordance rates can be effectively used to test if any two lanes of HiSeq 2000 data are from the same sample. As proof of principle, we use recent data from three different human samples generated on this platform. We show that the distributions of concordance rates are non-overlapping when comparing lanes from the same sample versus lanes from different samples. Our method proves to be robust even when different numbers of reads are analyzed. Finally, we provide a straightforward method for determining the gender of any given sample. Our results suggest that examining the concordance of detected genotypes from lanes purported to be from the same sample is a relatively simple approach for confirming that combined lanes of data are of the same identity and quality.

Highlights

  • Recent advances in sequencing throughput offer the ability to generate useful data from multiple individuals on a single run

  • We propose that a sufficient number of genotype calls can be made with data from a single HiSeq 2000 lane, resulting in enough overlapping positions between any two lanes of data to produce an accurate concordance rate

  • We suggest that the difference in concordance rates can be used to accurately determine if two lanes of data are from the same library and of comparable quality

Read more

Summary

Introduction

Recent advances in sequencing throughput offer the ability to generate useful data from multiple individuals on a single run. The impetus for this work came from our increased use of the newly developed HiSeq 2000, which allows us to generate highcoverage (30X or greater) whole-genome data with approximately 10 lanes This means we frequently run more than one human sample on a flowcell. The Illumina workflow introduces a ‘‘flowcell flip’’ between the cBot cluster station and the HiSeq 2000 sequencing instrument, which requires samples to be initially loaded on to the flowcell in reverse order This confusing and potentially error-prone step, combined with the ability to run two flowcells at once, increases the importance of verifying the identity of combined lanes, especially when a flowcell contains more than one sample from the same species (and simple reference genome alignment statistics cannot verify flowcell orientation)

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.