Introducing difference recurrence relations for faster semi-global alignment of long sequences

Hajime Suzuki,Masahiro Kasahara

doi:10.1186/s12859-018-2014-8

Hajime Suzuki, Masahiro Kasahara

Open Access

https://doi.org/10.1186/s12859-018-2014-8

Copy DOI

Journal: BMC bioinformatics	Publication Date: Feb 1, 2018
Citations: 83	License type: open-access

Affiliation: The University of Tokyo

Abstract

BackgroundThe read length of single-molecule DNA sequencers is reaching 1 Mb. Popular alignment software tools widely used for analyzing such long reads often take advantage of single-instruction multiple-data (SIMD) operations to accelerate calculation of dynamic programming (DP) matrices in the Smith–Waterman–Gotoh (SWG) algorithm with a fixed alignment start position at the origin. Nonetheless, 16-bit or 32-bit integers are necessary for storing the values in a DP matrix when sequences to be aligned are long; this situation hampers the use of the full SIMD width of modern processors.ResultsWe proposed a faster semi-global alignment algorithm, “difference recurrence relations,” that runs more rapidly than the state-of-the-art algorithm by a factor of 2.1. Instead of calculating and storing all the values in a DP matrix directly, our algorithm computes and stores mainly the differences between the values of adjacent cells in the matrix. Although the SWG algorithm and our algorithm can output exactly the same result, our algorithm mainly involves 8-bit integer operations, enabling us to exploit the full width of SIMD operations (e.g., 32) on modern processors. We also developed a library, libgaba, so that developers can easily integrate our algorithm into alignment programs.ConclusionsOur novel algorithm and optimized library implementation will facilitate accelerating nucleotide long-read analysis algorithms that use pairwise alignment stages. The library is implemented in the C programming language and available at https://github.com/ocxtal/libgaba.

Highlights

The read length of single-molecule DNA sequencers is reaching 1 Mb
The difference recurrences may be useful for nonbanded dynamic programming (DP) algorithms, but we excluded such algorithms because they run too slowly when input sequences are long owing to their time complexity of O n2 in contrast to O(n) for adaptive banded DP algorithms, where n is the length of input sequences
To compare our algorithm with the fastest alignment algorithm for a unit score matrix, we implemented an alignment algorithm based on Myers’ bit-parallel edit distance algorithm with an adaptive band, which is a slightly modified version of the algorithm authored by Kimura [33]

Summary

Introduction

The read length of single-molecule DNA sequencers is reaching 1 Mb. Popular alignment software tools widely used for analyzing such long reads often take advantage of single-instruction multiple-data (SIMD) operations to accelerate calculation of dynamic programming (DP) matrices in the Smith–Waterman–Gotoh (SWG) algorithm with a fixed alignment start position at the origin. Recent advances in single-molecule sequencers enabled researchers to obtain much longer reads than those offered by Sanger sequencers. Since Pacific Biosciences released its first real-time single-molecule sequencer, PacBio RS, in 2010, the read length of single-molecule sequencers has been increasing. Suzuki and Kasahara BMC Bioinformatics 2018, 19(Suppl 1): faster alignment algorithms that fully support long reads from single-molecule sequencers. BLASR [10], DALIGNER [11], and GraphMap [12] have shown a better balance among sensitivity, alignment quality, and computation time for long reads with abundant indels (insertions and deletions). The sensitivity and speed of the current alignment algorithms still need to be improved, especially for de novo assembly, which requires huge computation time for all-versus-all comparison of reads

Methods

Results

Discussion

Conclusion