Biobambam: tools for read pair collation based algorithms on BAM files

German Tischler,Steven Leonard

doi:10.1186/1751-0473-9-13

Abstract

BackgroundSequence alignment data is often ordered by coordinate (id of the reference sequence plus position on the sequence where the fragment was mapped) when stored in BAM files, as this simplifies the extraction of variants between the mapped data and the reference or of variants within the mapped data. In this order paired reads are usually separated in the file, which complicates some other applications like duplicate marking or conversion to the FastQ format which require to access the full information of the pairs.ResultsIn this paper we introduce biobambam, a set of tools based on the efficient collation of alignments in BAM files by read name. The employed collation algorithm avoids time and space consuming sorting of alignments by read name where this is possible without using more than a specified amount of main memory. Using this algorithm tasks like duplicate marking in BAM files and conversion of BAM files to the FastQ format can be performed very efficiently with limited resources. We also make the collation algorithm available in the form of an API for other projects. This API is part of the libmaus package.ConclusionsIn comparison with previous approaches to problems involving the collation of alignments by read name like the BAM to FastQ or duplication marking utilities our approach can often perform an equivalent task more efficiently in terms of the required main memory and run-time. Our BAM to FastQ conversion is faster than all widely known alternatives including Picard and bamUtil. Our duplicate marking is about as fast as the closest competitor bamUtil for small data sets and faster than all known alternatives on large and complex data sets.

Highlights

Sequence alignment data is often ordered by coordinate when stored in BAM files, as this simplifies the extraction of variants between the mapped data and the reference or of variants within the mapped data
The order of reads aligned to a reference which is most suitable for calling variants between the reads and the reference or within the reads is the one resulting from sorting the data by coordinate
Comparisons were performed with the current versions of the programs when our benchmarking for this paper started

Summary

Results

In this paper we introduce biobambam, a set of tools based on the efficient collation of alignments in BAM files by read name. The employed collation algorithm avoids time and space consuming sorting of alignments by read name where this is possible without using more than a specified amount of main memory. Using this algorithm tasks like duplicate marking in BAM files and conversion of BAM files to the FastQ format can be performed very efficiently with limited resources. We make the collation algorithm available in the form of an API for other projects. This API is part of the libmaus package

Conclusions

Background

Results and discussion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Source Code for Biology and Medicine	Publication Date: Jun 20, 2014
Citations: 228	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Biobambam: tools for read pair collation based algorithms on BAM files

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Source Code for Biology and Medicine

Lead the way for us

Similar Papers

BBCAnalyzer: a visual approach to facilitate variant calling
Sarah Sandmann ... Aniek O De Graaf
BMC Bioinformatics | VOL. 18
Sarah Sandmann, et. al.Sarah Sandmann ... Aniek O De Graaf
28 Feb 2017
BMC Bioinformatics | VOL. 18

Assessing the specificity of the Rosette agent DNA amplification: An optimized protocol for the detection of standard DNA among studies.
Emira Cherif ... Theo Deremarque
Journal of fish diseases | VOL. 46
Emira Cherif, et. al.Emira Cherif ... Theo Deremarque
30 Sep 2022
Journal of fish diseases | VOL. 46

Partial Order Optimum Likelihood (POOL): Maximum Likelihood Prediction of Protein Active Site Residues Using 3D Structure and Sequence Properties
Wenxu Tong ... Ronald J Williams
PLoS Computational Biology | VOL. 5
Wenxu Tong, et. al.Wenxu Tong ... Ronald J Williams
16 Jan 2009
PLoS Computational Biology | VOL. 5

Next-Generation Sequencing Strategies Enable Routine Detection of Balanced Chromosome Rearrangements for Clinical Diagnostics and Genetic Research
Michael E Talkowski ... James F Gusella
The American Journal of Human Genetics | VOL. 88
Michael E Talkowski, et. al.Michael E Talkowski ... James F Gusella
01 Apr 2011
The American Journal of Human Genetics | VOL. 88

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Biobambam: tools for read pair collation based algorithms on BAM files

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Source Code for Biology and Medicine