Small Alphabet Research Articles

BackgroundIn biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs in nucleotide as well as amino acid sequences. Searching with PSSMs in complete genomes or large sequence databases is a common, but computationally expensive task.ResultsWe present a new non-heuristic algorithm, called ESAsearch, to efficiently find matches of PSSMs in large databases. Our approach preprocesses the search space, e.g., a complete genome or a set of protein sequences, and builds an enhanced suffix array that is stored on file. This allows the searching of a database with a PSSM in sublinear expected time. Since ESAsearch benefits from small alphabets, we present a variant operating on sequences recoded according to a reduced alphabet. We also address the problem of non-comparable PSSM-scores by developing a method which allows the efficient computation of a matrix similarity threshold for a PSSM, given an E-value or a p-value. Our method is based on dynamic programming and, in contrast to other methods, it employs lazy evaluation of the dynamic programming matrix. We evaluated algorithm ESAsearch with nucleotide PSSMs and with amino acid PSSMs. Compared to the best previous methods, ESAsearch shows speedups of a factor between 17 and 275 for nucleotide PSSMs, and speedups up to factor 1.8 for amino acid PSSMs. Comparisons with the most widely used programs even show speedups by a factor of at least 3.8. Alphabet reduction yields an additional speedup factor of 2 on amino acid sequences compared to results achieved with the 20 symbol standard alphabet. The lazy evaluation method is also much faster than previous methods, with speedups of a factor between 3 and 330.ConclusionOur analysis of ESAsearch reveals sublinear runtime in the expected case, and linear runtime in the worst case for sequences not shorter than ||m + m - 1, where m is the length of the PSSM and a finite alphabet. In practice, ESAsearch shows superior performance over the most widely used programs, especially for DNA sequences. The new algorithm for accurate on-the-fly calculations of thresholds has the potential to replace formerly used approximation approaches. Beyond the algorithmic contributions, we provide a robust, well documented, and easy to use software package, implementing the ideas and algorithms presented in this manuscript.

Algorithms for exact string matching have substantial application in computational biology. Time-efficient data structures which support a variety of exact string matching queries, such as the suffix tree and the suffix array, have been applied to such problems. As sequence databases grow, more space-efficient approaches to exact matching are becoming more important. One such data structure, the compressed suffix array (CSA), based on the Burrows-Wheeler transform, has been shown to require memory which is nearly equal to the memory requirements of the original database, while supporting common sorts of query problems time efficiently. However, building a CSA from a sequence in efficient space and time is challenging. In 2002, the first space-efficient CSA construction algorithm was presented. That implementation used (1+2 log2 |summation|)(1+epsilon) bits per character (where epsilon is a small fraction). The construction algorithm ran in as much as twice that space, in O(| summation|n log(n)) time. We have created an implementation which can also achieve these asymptotic bounds, but for small alphabets, and only uses 1/2 (1+|summation|)(1+epsilon) bits per character, a factor of 2 less space for nucleotide alphabets. We present time and space results for the CSA construction and querying of our implementation on publicly available genome data which demonstrate the practicality of this approach.

Small Alphabet Research Articles

Related Topics

Articles published on Small Alphabet

Fast index based algorithms and software for matching position specific scoring matrices.

An improvement of the tree code construction

PAMA: A FAST STRING MATCHING ALGORITHM

Faster Algorithms for Computing Longest Common Increasing Subsequences

A Space-Efficient Construction of the Burrows–Wheeler Transform for Genomic Data

Tight approximability results for test set problems in bioinformatics

On average sequence complexity

Modified Chu sequences with smaller alphabet size

Fast prefix matching of bounded strings

Automated identification of RNA conformational motifs: theory and application to the HM LSU 23S rRNA.

Shift-or string matching with super-alphabets

High precision simulations of the longest common subsequence problem

(Coarse coding of shape fragments) + (retinotopy) approximately = representation of structure.

Linearly representable codes over chain rings

On the complexity measures of genetic sequences.

Polyphase Barker sequences up to length 45 with small alphabets

Algorithms for the longest common subsequence problem for multiple strings based on geometric maxima

Almost-perfect polyphase sequences with small phase alphabet

Random structures and evolution of biopolymers: A computational case study on RNA secondary structures

A subquadratic algorithm for approximate limited expression matching

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Small Alphabet Research Articles

Related Topics

Articles published on Small Alphabet

Fast index based algorithms and software for matching position specific scoring matrices.

An improvement of the tree code construction

PAMA: A FAST STRING MATCHING ALGORITHM

Faster Algorithms for Computing Longest Common Increasing Subsequences

A Space-Efficient Construction of the Burrows–Wheeler Transform for Genomic Data

Tight approximability results for test set problems in bioinformatics

On average sequence complexity

Modified Chu sequences with smaller alphabet size

Fast prefix matching of bounded strings

Automated identification of RNA conformational motifs: theory and application to the HM LSU 23S rRNA.

Shift-or string matching with super-alphabets

High precision simulations of the longest common subsequence problem

(Coarse coding of shape fragments) + (retinotopy) approximately = representation of structure.

Linearly representable codes over chain rings

On the complexity measures of genetic sequences.

Polyphase Barker sequences up to length 45 with small alphabets

Algorithms for the longest common subsequence problem for multiple strings based on geometric maxima

Almost-perfect polyphase sequences with small phase alphabet

Random structures and evolution of biopolymers: A computational case study on RNA secondary structures

A subquadratic algorithm for approximate limited expression matching