Abstract
BackgroundIt is extremely common to need to select a subset of reads from a BAM file based on their specific properties. Typically, a user unpacks the BAM file to a text stream using SAMtools, parses and filters the lines using AWK, then repacks them using SAMtools. This process is tedious and error-prone. In particular, when working with many columns of data, mix-ups are common and the bit field containing the flags is unintuitive. There are several libraries for reading BAM files, such as Bio-SamTools for Perl and pysam for Python. Both allow access to the BAM’s read information and can filter reads, but require substantial boilerplate code; this is high overhead for mostly ad hoc filtering.ResultsWe have created a query language that gathers reads using a collection of predicates and common logical connectives. Queries run faster than equivalents and can be compiled to native code for embedding in larger programs.ConclusionsBAMQL provides a user-friendly, powerful and performant way to extract subsets of BAM files for ad hoc analyses or integration into applications. The query language provides a collection of predicates beyond those in SAMtools, and more flexible connectives.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1162-y) contains supplementary material, which is available to authorized users.
Highlights
It is extremely common to need to select a subset of reads from a Binary Alignment/Map (BAM) file based on their specific properties
To simplify the subsetting process, but retain the ability to have powerful queries, we developed BAM Query Language (BAMQL), a domain-specific language for matching BAM reads
To test the efficacy, we compared several equivalent queries written in BAMQL, SAMtools (v1.3) + GNU AWK (v4.0.1), Sambamba (v0.5.9), Python using pysam (v0.8.2) [3], Perl using Bio-SamTools (v1.41) [4], and C using HTSlib (v1.1)
Summary
It is extremely common to need to select a subset of reads from a BAM file based on their specific properties. A user unpacks the BAM file to a text stream using SAMtools, parses and filters the lines using AWK, repacks them using SAMtools. There are several libraries for reading BAM files, such as Bio-SamTools for Perl and pysam for Python. Both allow access to the BAM’s read information and can filter reads, but require substantial boilerplate code; this is high overhead for mostly ad hoc filtering. The selection condition is restricted to a filter that describes which
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have