Abstract
MotivationRead alignment is central to many aspects of modern genomics. Most aligners use heuristics to accelerate processing, but these heuristics can fail to find the optimal alignments of reads. Alignment accuracy is typically measured through simulated reads; however, the simulated location may not be the (only) location with the optimal alignment score.ResultsVargas implements a heuristic-free algorithm guaranteed to find the highest-scoring alignment for real sequencing reads to a linear or graph genome. With semiglobal and local alignment modes and affine gap and quality-scaled mismatch penalties, it can implement the scoring functions of commonly used aligners to calculate optimal alignments. While this is computationally intensive, Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 456 billion cell updates per second. We demonstrate how these ‘gold standard’ Vargas alignments can be used to improve heuristic alignment accuracy by optimizing command-line parameters in Bowtie 2, BWA-maximal exact match and vg to align more reads correctly.Availability and implementationSource code implemented in C++ and compiled binary releases are available at https://github.com/langmead-lab/vargas under the MIT license.Supplementary information Supplementary data are available at Bioinformatics online.
Highlights
Biological gold standards such as the Platinum Genomes (Eberle et al, 2017), synthetic diploid (Li et al, 2018), and Genome in a Bottle (Zook et al, 2014) catalog the variants present in a genome and are used to benchmark variant calling algorithms on real sequencing data
We presented Vargas, a heuristic-free read alignment tool achieving extremely high multithreaded throughput
Read alignments produced by Vargas can be used as a computational gold standard for evaluating short-read alignment algorithms, including with real sequencing datasets, and in much the same way as biological gold standards are used to assess variant calling algorithms
Summary
Biological gold standards such as the Platinum Genomes (Eberle et al, 2017), synthetic diploid (Li et al, 2018), and Genome in a Bottle (Zook et al, 2014) catalog the variants present in a genome and are used to benchmark variant calling algorithms on real sequencing data. For benchmarking and algorithm development, using gold standard call sets is more realistic than simulating sequencing reads from a synthetic genome with known variants. Read alignment algorithms, which determine a sequencing read’s point of origin with respect to a reference genome, are instead often evaluated using simulated sequencing reads due to the lack of a biological gold standard that directly answers questions about where sequencing reads should align.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.