Abstract
BackgroundNext generation sequencing (NGS) of amplified DNA is a powerful tool to describe genetic heterogeneity within cell populations that can both be used to investigate the clonal structure of cell populations and to perform genetic lineage tracing. For applications in which both abundant and rare sequences are biologically relevant, the relatively high error rate of NGS techniques complicates data analysis, as it is difficult to distinguish rare true sequences from spurious sequences that are generated by PCR or sequencing errors. This issue, for instance, applies to cellular barcoding strategies that aim to follow the amount and type of offspring of single cells, by supplying these with unique heritable DNA tags.ResultsHere, we use genetic barcoding data from the Illumina HiSeq platform to show that straightforward read threshold-based filtering of data is typically insufficient to filter out spurious barcodes. Importantly, we demonstrate that specific sequencing errors occur at an approximately constant rate across different samples that are sequenced in parallel. We exploit this observation by developing a novel approach to filter out spurious sequences.ConclusionsApplication of our new method demonstrates its value in the identification of true sequences amongst spurious sequences in biological data sets.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-0999-4) contains supplementary material, which is available to authorized users.
Highlights
Generation sequencing (NGS) of amplified DNA is a powerful tool to describe genetic heterogeneity within cell populations that can both be used to investigate the clonal structure of cell populations and to perform genetic lineage tracing
Overview of experimental barcoding technology In cellular barcoding (Fig. 1), progenitor cells of interest are isolated from appropriate tissue and exposed to a library of retro- or lenti-viral vectors that each carry one DNA barcode from a large pool of barcodes
We present a novel approach to clean up barcoding data that does not require independent sequencing of a reference library, and that is based on our observation that individual sequencing error occurs at a predictable rate across samples in Illumina HiSeq data
Summary
Generation sequencing (NGS) of amplified DNA is a powerful tool to describe genetic heterogeneity within cell populations that can both be used to investigate the clonal structure of cell populations and to perform genetic lineage tracing. For applications in which both abundant and rare sequences are biologically relevant, the relatively high error rate of NGS techniques complicates data analysis, as it is difficult to distinguish rare true sequences from spurious sequences that are generated by PCR or sequencing errors. This issue, for instance, applies to cellular barcoding strategies that aim to follow the amount and type of offspring of single cells, by supplying these with unique heritable DNA tags. Quantification of the amount of offspring of a barcoded cell is achieved by PCR amplification, followed by generation sequencing. Indexing of samples allows one to run many samples of different cell types, organs and time points within a single deep sequencing run, thereby allowing highthroughput acquisition of data [17]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.