Abstract
BackgroundThe Sequence Alignment/Map Format Specification (SAM) is one of the most widely adopted file formats in bioinformatics and many researchers use it daily. Several tools, including most high-throughput sequencing read aligners, use it as their primary output and many more tools have been developed to process it. However, despite its flexibility, SAM encoded files can often be difficult to query and understand even for experienced bioinformaticians. As genomic data are rapidly growing, structured, and efficient queries on data that are encoded in SAM/BAM files are becoming increasingly important. Existing tools are very limited in their query capabilities or are not efficient. Critically, new tools that address these shortcomings, should not be able to support existing large datasets but should also do so without requiring massive data transformations and file infrastructure reorganizations.ResultsHere we introduce SamQL, an SQL-like query language for the SAM format with intuitive syntax that supports complex and efficient queries on top of SAM/BAM files and that can replace commonly used Bash one-liners employed by many bioinformaticians. SamQL has high expressive power with no upper limit on query size and when parallelized, outperforms other substantially less expressive software.ConclusionsSamQL is a complete query language that we envision as a step to a structured database engine for genomics. SamQL is written in Go, and is freely available as standalone program and as an open-source library under an MIT license, https://github.com/maragkakislab/samql/.
Highlights
The Sequence Alignment/Map Format Specification (SAM) is one of the most widely adopted file formats in bioinformatics and many researchers use it daily
SamQL is a complete query language with a lexer and parser designed for genomic data in the Sequence Alignment Map (SAM)/Binary SAM (BAM) format
SamQL can replace most one-liners used by bioinformaticians, helping to reduce errors
Summary
Our primary aim building SamQL was flexibility and high expressivity for complex queries, similar to classic SQL. To evaluate the query performance and decouple it from Input/Output (IO) we measured the execution time both when printing to an output file and just counting the filtered reads. As a test we wished to filter on the NH:i tag that involves numerical comparisons This is an intuitive and straightforward query change in SamQL (Fig. 2B, top) and Sambamba. Our results again show that range queries for all tools are executed much faster than naive Bash (Additional file 1: Fig. S1A) and comparable with each other. All our tests indicate that SamQL offers high expressivity for complex queries while achieving high performance and being able to utilize and take advantage of parallel computing
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.