TagDust2: a generic method to extract reads from sequencing data.

Timo Lassmann

doi:10.1186/s12859-015-0454-y

Abstract

BackgroundArguably the most basic step in the analysis of next generation sequencing data (NGS) involves the extraction of mappable reads from the raw reads produced by sequencing instruments. The presence of barcodes, adaptors and artifacts subject to sequencing errors makes this step non-trivial.ResultsHere I present TagDust2, a generic approach utilizing a library of hidden Markov models (HMM) to accurately extract reads from a wide array of possible read architectures. TagDust2 extracts more reads of higher quality compared to other approaches. Processing of multiplexed single, paired end and libraries containing unique molecular identifiers is fully supported. Two additional post processing steps are included to exclude known contaminants and filter out low complexity sequences. Finally, TagDust2 can automatically detect the library type of sequenced data from a predefined selection.ConclusionTaken together TagDust2 is a feature rich, flexible and adaptive solution to go from raw to mappable NGS reads in a single step. The ability to recognize and record the contents of raw reads will help to automate and demystify the initial, and often poorly documented, steps in NGS data analysis pipelines. TagDust2 is freely available at: http://tagdust.sourceforge.net.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0454-y) contains supplementary material, which is available to authorized users.

Highlights

The most basic step in the analysis of generation sequencing data (NGS) involves the extraction of mappable reads from the raw reads produced by sequencing instruments
In the best case these errors lead to some sequences being lost to the downstream analysis, but Correspondence: timolassmann@gmail.com 1RIKEN Center for Life Science Technologies (CLST), RIKEN Yokohama Institute, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Kanagawa, Japan 2Telethon Kids Institute, The University of Western Australia, 100 Roberts Road, Subiaco, Subiaco, Western Australia 6008, Australia in the worse case sequences can be mixed up between samples leading to analytical noise
Given that the original read and sample are known the number of reads assigned to the wrong sample and the total number of extracted reads can be quantified

Summary

Results

To assess the performance of TagDust I generated large datasets by varying the number of barcodes used, their lengths and the per base error rate. An additional 10 thousand random sequences were added to assess the number of false positives. The number of barcodes and their length was varied together with the sequencer error rate. TagDust is more conservative at extracting reads compared to fastx when using 4nt barcodes. As the number of barcodes and error rates are increased the precision of both programs is decreasing. The precision of TagDust is consistently higher compared to fastx and far less affected by the per-base error rate. Increasing the barcode length to six nucleotides makes it much easier to unambiguously assign reads to a particular sample (Figure 3). TagDust is consistently more precise compared to fastx

Conclusion

Background

24 Barcodes

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jan 28, 2015
Citations: 75	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

TagDust2: a generic method to extract reads from sequencing data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Author response: Tiled-ClickSeq for targeted sequencing of complete coronavirus genomes with simultaneous capture of RNA recombination and minority variants
Elizabeth Jaworski ...
-
Elizabeth Jaworski, et. al.Elizabeth Jaworski ...
03 Sep 2021
03 Sep 2021

Harnessing NGS and Big Data Optimally: Comparison of miRNA Prediction from Assembled versus Non-assembled Sequencing Data--The Case of the Grass Aegilops tauschii Complex Genome.
Hikmet Budak ... Melda Kantar
OMICS: A Journal of Integrative Biology | VOL. 19
Hikmet Budak, et. al.Hikmet Budak ... Melda Kantar
10 Jun 2015
OMICS: A Journal of Integrative Biology | VOL. 19

Editor's evaluation: Improved T cell receptor antigen pairing through data-driven filtering of sequencing information from single cells
K Christopher Garcia
-
K Christopher GarciaK Christopher Garcia
11 Oct 2022
11 Oct 2022

Resistance Sniffer: An online tool for prediction of drug resistance patterns of Mycobacterium tuberculosis isolates using next generation sequencing data
Dillon Muzondiwa ... Oleg N Reva
International Journal of Medical Microbiology | VOL. 310
Dillon Muzondiwa, et. al.Dillon Muzondiwa ... Oleg N Reva
17 Jan 2020
International Journal of Medical Microbiology | VOL. 310

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

TagDust2: a generic method to extract reads from sequencing data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics