Pairwise alignment of nucleotide sequences using maximal exact matches

Arash Bayat,Aleksandar Ignjatović ,Bruno Gaëta,Sri Parameswaran

doi:10.1186/s12859-019-2827-0

Arash Bayat, Aleksandar Ignjatović + Show 2 more

Open Access

https://doi.org/10.1186/s12859-019-2827-0

Copy DOI

Journal: BMC Bioinformatics	Publication Date: May 21, 2019
Citations: 5	License type: open-access

Affiliation: UNSW Sydney, CSIRO Health and Biosecurity

Abstract

BackgroundPairwise alignment of short DNA sequences with affine-gap scoring is a common processing step performed in a range of bioinformatics analyses. Dynamic programming (i.e. Smith-Waterman algorithm) is widely used for this purpose. Despite using data level parallelisation, pairwise alignment consumes much time. There are faster alignment algorithms but they suffer from the lack of accuracy.ResultsIn this paper, we present MEM-Align, a fast semi-global alignment algorithm for short DNA sequences that allows for affine-gap scoring and exploit sequence similarity. In contrast to traditional alignment method (such as Smith-Waterman) where individual symbols are aligned, MEM-Align extracts Maximal Exact Matches (MEMs) using a bit-level parallel method and then looks for a subset of MEMs that forms the alignment using a novel dynamic programming method. MEM-Align tries to mimic alignment produced by Smith-Waterman. As a result, for 99.9% of input sequence pair, the computed alignment score is identical to the alignment score computed by Smith-Waterman. Yet MEM-Align is up to 14.5 times faster than the Smith-Waterman algorithm. Fast run-time is achieved by: (a) using a bit-level parallel method to extract MEMs; (b) processing MEMs rather than individual symbols; and, (c) applying heuristics.ConclusionsMEM-Align is a potential candidate to replace other pairwise alignment algorithms used in processes such as DNA read-mapping and Variant-Calling.

Highlights

Pairwise alignment of short DNA sequences with affine-gap scoring is a common processing step performed in a range of bioinformatics analyses
Synthetic datasets were prepared by random selection of short sequences from the reference human genome followed by simulated variations
Two different configurations of Maximal exact matches (MEM)-Align (MA1 and MA2) as described in Table 4 were compared with four other alignment algorithms: A Single instruction multiple data (SIMD) implementation of Smith-Waterman (SSW) [22]; an implementation of Ukkonen algorithm (UKK) taken from SNAP [12]; a Gene Myers algorithm (GM); and, a combination of Gene Myers with Hirschberg algorithm (GMH) implemented in the SeqAN package [23]

Summary

Introduction

Pairwise alignment of short DNA sequences with affine-gap scoring is a common processing step performed in a range of bioinformatics analyses. Dynamic programming (i.e. Smith-Waterman algorithm) is widely used for this purpose. The term alignment covers a broad range of different processes. Seed-and-extend alignment method is a popular technique for aligning reads to the reference-genome. This technique is used in DNA read-mappers such as BWA [2, 3] and Bowtie [4, 5]. In the seed-and-extend technique, small subsequences of a read (called seeds) are searched in the reference-genome to find candidate regions. Once a rough alignment is identified (seeding-step), the read is typically aligned to all candidate regions using a dynamic programming algorithm (extending-step)

Methods

Results

Discussion

Conclusion