Abstract

BackgroundArguably the most basic step in the analysis of next generation sequencing data (NGS) involves the extraction of mappable reads from the raw reads produced by sequencing instruments. The presence of barcodes, adaptors and artifacts subject to sequencing errors makes this step non-trivial.ResultsHere I present TagDust2, a generic approach utilizing a library of hidden Markov models (HMM) to accurately extract reads from a wide array of possible read architectures. TagDust2 extracts more reads of higher quality compared to other approaches. Processing of multiplexed single, paired end and libraries containing unique molecular identifiers is fully supported. Two additional post processing steps are included to exclude known contaminants and filter out low complexity sequences. Finally, TagDust2 can automatically detect the library type of sequenced data from a predefined selection.ConclusionTaken together TagDust2 is a feature rich, flexible and adaptive solution to go from raw to mappable NGS reads in a single step. The ability to recognize and record the contents of raw reads will help to automate and demystify the initial, and often poorly documented, steps in NGS data analysis pipelines. TagDust2 is freely available at: http://tagdust.sourceforge.net.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0454-y) contains supplementary material, which is available to authorized users.

Highlights

  • The most basic step in the analysis of generation sequencing data (NGS) involves the extraction of mappable reads from the raw reads produced by sequencing instruments

  • In the best case these errors lead to some sequences being lost to the downstream analysis, but Correspondence: timolassmann@gmail.com 1RIKEN Center for Life Science Technologies (CLST), RIKEN Yokohama Institute, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Kanagawa, Japan 2Telethon Kids Institute, The University of Western Australia, 100 Roberts Road, Subiaco, Subiaco, Western Australia 6008, Australia in the worse case sequences can be mixed up between samples leading to analytical noise

  • Given that the original read and sample are known the number of reads assigned to the wrong sample and the total number of extracted reads can be quantified

Read more

Summary

Results

To assess the performance of TagDust I generated large datasets by varying the number of barcodes used, their lengths and the per base error rate. An additional 10 thousand random sequences were added to assess the number of false positives. The number of barcodes and their length was varied together with the sequencer error rate. TagDust is more conservative at extracting reads compared to fastx when using 4nt barcodes. As the number of barcodes and error rates are increased the precision of both programs is decreasing. The precision of TagDust is consistently higher compared to fastx and far less affected by the per-base error rate. Increasing the barcode length to six nucleotides makes it much easier to unambiguously assign reads to a particular sample (Figure 3). TagDust is consistently more precise compared to fastx

Conclusion
Background
24 Barcodes
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.