Bit-parallel sequence-to-graph alignment.

Mikko Rautiainen,Tobias Marschall,Veli Mäkinen,Inanc Birol

doi:10.1093/bioinformatics/btz162

Mikko Rautiainen, Tobias Marschall + Show 2 more

Open Access

https://doi.org/10.1093/bioinformatics/btz162

Copy DOI

Abstract

MotivationGraphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction and variant calling with respect to a variation graph.ResultsWe generalize two linear sequence-to-sequence algorithms to graphs: the Shift-And algorithm for exact matching and Myers’ bitvector algorithm for semi-global alignment. These linear algorithms are both based on processing w sequence characters with a constant number of operations, where w is the word size of the machine (commonly 64), and achieve a speedup of up to w over naive algorithms. For a graph with nodes and edges and a sequence of length m, our bitvector-based graph alignment algorithm reaches a worst case runtime of for acyclic graphs and for arbitrary cyclic graphs. We apply it to five different types of graphs and observe a speedup between 3-fold and 20-fold compared with a previous (asymptotically optimal) alignment algorithm.Availability and implementation https://github.com/maickrau/GraphAligner Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

Aligning two sequences is a classic problem in bioinformatics
Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction and variant calling with respect to a variation graph
In addition to representing genomic diversity, graphs whose nodes or edges are labeled by characters are commonly used in many other applications in bioinformatics, for instance genome assembly (Compeau et al, 2011; Miller et al, 2010) and multiple sequence

Summary

Introduction

The standard dynamic programming (DP) algorithm, introduced by Needleman and Wunsch (1970), aligns two sequences of length n in Oðn2Þ time Countless variants of this classic DP algorithm exist, in particular its generalization to local alignment (Smith and Waterman, 1981), where the alignment can be between any substrings of the two sequences, and semi-global alignment (Sellers, 1980) where one sequence (query) is entirely aligned to a substring of the other (reference). We witness a strong interest in pan-genomic methods for representing and analyzing the variations between individual genomes in a manner that avoids duplicate work in the shared genomic areas (Computational Pan-Genomics Consortium, 2018; Danek et al, 2014; Rahn et al, 2014) One such method is to use a graph as the reference, which provides a simple way of representing both shared and unique areas, and can represent complex variations as well (Garrison et al, 2018; Paten et al, 2017). In addition to representing genomic diversity, graphs whose nodes or edges are labeled by characters are commonly used in many other applications in bioinformatics, for instance genome assembly (Compeau et al, 2011; Miller et al, 2010) and multiple sequence

Methods

Results

Conclusion