Abstract

The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support ‘spaced seeds’ and ‘subset seeds’ used in many biological applications.

Highlights

  • The problem of finding the occurrences of a pattern string in a given text is one of the most fundamental computational tasks in bioinformatics

  • We focus only on linear-time algorithms, and in particular on a recent algorithm called SA-IS proposed by Nong et al [23, 24]

  • It turns out that exploiting this property leads to more efficient algorithms, as we describe in this article

Read more

Summary

Introduction

The problem of finding the occurrences of a pattern string in a given text is one of the most fundamental computational tasks in bioinformatics. One simple and effective data structure is a suffix array, which informally is a list of the starting positions of the suffixes of the text, sorted by their alphabetical order. It is possible to construct a modified suffix array that affords efficient search for all suffixes matching (a prefix of) the pattern ‘[ga]..c’, i.e. any occurrence of g or a followed by a c three positions later.

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call